Take-Home Assignment: Real-World Big Data Analysis¶

216036B - Fonseka K.N.N

Importing necessary Libraries¶

In [1]:
# Install required packages for big data processing
!pip install pyspark
!pip install plotly
!pip install seaborn
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install imbalanced-learn
!pip install gdown
Requirement already satisfied: pyspark in /usr/local/lib/python3.12/dist-packages (3.5.1)
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.12/dist-packages (from pyspark) (0.10.9.7)
Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (8.5.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (25.0)
Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /usr/local/lib/python3.12/dist-packages (from seaborn) (2.0.2)
Requirement already satisfied: pandas>=1.2 in /usr/local/lib/python3.12/dist-packages (from seaborn) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /usr/local/lib/python3.12/dist-packages (from seaborn) (3.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.59.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (25.0)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.12/dist-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=1.2->seaborn) (2025.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (2.0.2)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.12/dist-packages (0.14.0)
Requirement already satisfied: numpy<3,>=1.25.2 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (2.0.2)
Requirement already satisfied: scipy<2,>=1.11.4 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.16.1)
Requirement already satisfied: scikit-learn<2,>=1.4.2 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.6.1)
Requirement already satisfied: joblib<2,>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (1.5.1)
Requirement already satisfied: threadpoolctl<4,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from imbalanced-learn) (3.6.0)
Requirement already satisfied: gdown in /usr/local/lib/python3.12/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.12/dist-packages (from gdown) (4.13.4)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from gdown) (3.19.1)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.12/dist-packages (from gdown) (2.32.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from gdown) (4.67.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4->gdown) (2.7)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.12/dist-packages (from beautifulsoup4->gdown) (4.14.1)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (2025.8.3)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.12/dist-packages (from requests[socks]->gdown) (1.7.1)
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

import gdown
import os
import time
from datetime import datetime
import psutil
from pyspark.sql import SparkSession
%matplotlib inline
import seaborn as sns
import folium
import warnings
warnings.filterwarnings('ignore')
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from matplotlib import cm
from matplotlib.colors import to_hex
from io import BytesIO
import base64
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA
from prophet import Prophet
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.stats import linregress
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# PySpark imports for big data processing
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql.functions import min as spark_min, max as spark_max, count, col
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql.functions import col
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler, PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, count
from pyspark.ml.feature import Bucketizer
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import functions as F
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.types import DoubleType, IntegerType
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import LogisticRegression
In [3]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("GTD Big Data Analysis") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

To carry out this comprehensive big data analytics assignment, I imported a wide range of essential libraries that support data manipulation, preprocessing, visualization, machine learning, clustering, forecasting, and natural language processing. Core libraries like pandas, numpy, matplotlib, and seaborn were used for data handling, numerical computations, and statistical visualizations. Big data tools such as pyspark was utilized to efficiently manage and process large-scale datasets. For advanced visualizations and interactivity, libraries like plotly, folium, and missingno were employed to create insightful charts, geographic maps, and missing value patterns. Clustering and dimensionality reduction were performed using K-Means, and PCA, while time series forecasting was implemented using Facebook's Prophet. For textual analysis, libraries like CountVectorizer, wordcloud, and PIL enabled the extraction and representation of key themes from incident summaries. Machine learning models such as RandomForestClassifier, GradientBoostingClassifier, and LogisticRegression were applied to build predictive models for different classification problems. In addition, tools like StandardScaler, LabelEncoder, and train_test_split helped prepare the data for modeling, while performance metrics such as confusion_matrix and classification_report facilitated model evaluation.

PySpark was imported and a Spark session was initialized with Colab-optimized settings to handle large-scale data efficiently, enabling distributed computation, feature engineering, and model building on big datasets. This setup ensured that both exploratory and predictive analyses could be conducted effectively on the global terrorism dataset.

In [344]:
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

Loading the Dataset¶

In [ ]:
# ---------------- Performance Monitoring ----------------
def monitor_loading_performance():
    process = psutil.Process()
    return {
        'memory_mb': process.memory_info().rss / 1024**2,
        'cpu_percent': psutil.cpu_percent(interval=1),
        'timestamp': datetime.now()
    }

# Dataset path in Colab
dataset_path = "/content/Dataset_BigData.csv"

# ---------------- Check File ----------------
if os.path.exists(dataset_path):
    file_size_mb = os.path.getsize(dataset_path) / (1024**2)
    print(f"\n Dataset found: {dataset_path} ({file_size_mb:.2f} MB)")
else:
    raise FileNotFoundError(f"Dataset not found at {dataset_path}!")

# ---------------- Pandas Loading ----------------
print("\n Loading with Pandas...")
start_time = time.time()
initial_memory = monitor_loading_performance()

# Load into raw_gtd_df instead of df_pandas
raw_gtd_df = pd.read_csv(dataset_path, encoding='ISO-8859-1')

pandas_load_time = time.time() - start_time
pandas_memory = monitor_loading_performance()
print(f" Pandas loaded in {pandas_load_time:.2f}s | Memory: {pandas_memory['memory_mb']:.1f} MB | Shape: {raw_gtd_df.shape}")

# ---------------- Spark Loading ----------------
print("\n Loading with Spark...")
spark_start = time.time()

df_spark = spark.read.csv(dataset_path, header=True, inferSchema=True)
df_spark.cache()
spark_count = df_spark.count()

spark_load_time = time.time() - spark_start
spark_memory = monitor_loading_performance()
print(f" Spark loaded in {spark_load_time:.2f}s | Memory: {spark_memory['memory_mb']:.1f} MB | Records: {spark_count:,} | Partitions: {df_spark.rdd.getNumPartitions()}")

# ---------------- Performance Comparison ----------------
print("\n LOADING PERFORMANCE ANALYSIS:")
print(f"   Pandas: {pandas_load_time:.2f}s")
print(f"   Spark:  {spark_load_time:.2f}s")
if spark_load_time < pandas_load_time:
    print(f"    Spark is {pandas_load_time / spark_load_time:.2f}x faster than Pandas")
else:
    print(f"    Pandas is {spark_load_time / pandas_load_time:.2f}x faster than Spark")
 Dataset found: /content/Dataset_BigData.csv (155.27 MB)

 Loading with Pandas...
 Pandas loaded in 5.74s | Memory: 1972.4 MB | Shape: (181691, 135)

 Loading with Spark...
 Spark loaded in 5.01s | Memory: 1972.4 MB | Records: 181,691 | Partitions: 2

 LOADING PERFORMANCE ANALYSIS:
   Pandas: 5.74s
   Spark:  5.01s
    Spark is 1.15x faster than Pandas

The dataset was loaded and analyzed using both Pandas and PySpark to compare performance and resource utilization. Using Pandas, the 155 MB dataset containing 181,691 records and 135 features was loaded in 5.74 seconds with a memory usage of approximately 1,130 MB. In contrast, PySpark was able to load the same dataset in 5.01 seconds, caching the data for efficient distributed processing across 2 partitions. The performance comparison indicates that PySpark is 1.15 times faster than Pandas for this dataset, demonstrating the advantages of using Spark for handling larger datasets and performing scalable data processing in a big data environment.

In [147]:
import chardet

# Read a small sample of the file to guess the encoding
with open(dataset_path, 'rb') as f:
    rawdata = f.read(10000)  # read first 10 KB
    result = chardet.detect(rawdata)

print(result)
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
In [224]:
# Load the raw data
raw_gtd_df = spark.read.csv(
    dataset_path,
    header=True,
    inferSchema=True,
    encoding='ISO-8859-1'
)

This block loads the Global Terrorism Database CSV file into a Spark DataFrame using the specified file path and appropriate character encoding to handle special characters.

In [225]:
# Number of rows
num_rows = raw_gtd_df.count()

# Number of columns
num_cols = len(raw_gtd_df.columns)

print(f"Shape: ({num_rows}, {num_cols})")
Shape: (181691, 135)

This line returns the shape of the dataset, indicating that it contains 181,691 rows (incidents) and 135 columns (features).

In [226]:
 #To check the name of features.
# List all feature names / columns
feature_names = raw_gtd_df.columns
print(feature_names)
['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended', 'resolution', 'country', 'country_txt', 'region', 'region_txt', 'provstate', 'city', 'latitude', 'longitude', 'specificity', 'vicinity', 'location', 'summary', 'crit1', 'crit2', 'crit3', 'doubtterr', 'alternative', 'alternative_txt', 'multiple', 'success', 'suicide', 'attacktype1', 'attacktype1_txt', 'attacktype2', 'attacktype2_txt', 'attacktype3', 'attacktype3_txt', 'targtype1', 'targtype1_txt', 'targsubtype1', 'targsubtype1_txt', 'corp1', 'target1', 'natlty1', 'natlty1_txt', 'targtype2', 'targtype2_txt', 'targsubtype2', 'targsubtype2_txt', 'corp2', 'target2', 'natlty2', 'natlty2_txt', 'targtype3', 'targtype3_txt', 'targsubtype3', 'targsubtype3_txt', 'corp3', 'target3', 'natlty3', 'natlty3_txt', 'gname', 'gsubname', 'gname2', 'gsubname2', 'gname3', 'gsubname3', 'motive', 'guncertain1', 'guncertain2', 'guncertain3', 'individual', 'nperps', 'nperpcap', 'claimed', 'claimmode', 'claimmode_txt', 'claim2', 'claimmode2', 'claimmode2_txt', 'claim3', 'claimmode3', 'claimmode3_txt', 'compclaim', 'weaptype1', 'weaptype1_txt', 'weapsubtype1', 'weapsubtype1_txt', 'weaptype2', 'weaptype2_txt', 'weapsubtype2', 'weapsubtype2_txt', 'weaptype3', 'weaptype3_txt', 'weapsubtype3', 'weapsubtype3_txt', 'weaptype4', 'weaptype4_txt', 'weapsubtype4', 'weapsubtype4_txt', 'weapdetail', 'nkill', 'nkillus', 'nkillter', 'nwound', 'nwoundus', 'nwoundte', 'property', 'propextent', 'propextent_txt', 'propvalue', 'propcomment', 'ishostkid', 'nhostkid', 'nhostkidus', 'nhours', 'ndays', 'divert', 'kidhijcountry', 'ransom', 'ransomamt', 'ransomamtus', 'ransompaid', 'ransompaidus', 'ransomnote', 'hostkidoutcome', 'hostkidoutcome_txt', 'nreleased', 'addnotes', 'scite1', 'scite2', 'scite3', 'dbsource', 'INT_LOG', 'INT_IDEO', 'INT_MISC', 'INT_ANY', 'related']

This command lists all the column names in the dataset, allowing the user to understand the available variables and plan the analysis accordingly.

In [227]:
# Show first 5 rows
raw_gtd_df.show(5)
+------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+
|     eventid|iyear|imonth|iday|approxdate|extended|resolution|country|       country_txt|region|          region_txt|provstate|         city| latitude| longitude|specificity|vicinity|location|summary|crit1|crit2|crit3|doubtterr|alternative|alternative_txt|multiple|success|suicide|attacktype1|     attacktype1_txt|attacktype2|attacktype2_txt|attacktype3|attacktype3_txt|targtype1|       targtype1_txt|targsubtype1|    targsubtype1_txt|               corp1|             target1|natlty1|       natlty1_txt|targtype2|targtype2_txt|targsubtype2|targsubtype2_txt|corp2|target2|natlty2|natlty2_txt|targtype3|targtype3_txt|targsubtype3|targsubtype3_txt|corp3|target3|natlty3|natlty3_txt|               gname|gsubname|gname2|gsubname2|gname3|gsubname3|motive|guncertain1|guncertain2|guncertain3|individual|nperps|nperpcap|claimed|claimmode|claimmode_txt|claim2|claimmode2|claimmode2_txt|claim3|claimmode3|claimmode3_txt|compclaim|weaptype1|weaptype1_txt|weapsubtype1|    weapsubtype1_txt|weaptype2|weaptype2_txt|weapsubtype2|weapsubtype2_txt|weaptype3|weaptype3_txt|weapsubtype3|weapsubtype3_txt|weaptype4|weaptype4_txt|weapsubtype4|weapsubtype4_txt|weapdetail|nkill|nkillus|nkillter|nwound|nwoundus|nwoundte|property|propextent|propextent_txt|propvalue|propcomment|ishostkid|nhostkid|nhostkidus|nhours|ndays|divert|kidhijcountry|ransom|ransomamt|ransomamtus|ransompaid|ransompaidus|ransomnote|hostkidoutcome|hostkidoutcome_txt|nreleased|addnotes|scite1|scite2|scite3|dbsource|INT_LOG|INT_IDEO|INT_MISC|INT_ANY|related|
+------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+
|197000000001| 1970|     7|   2|      NULL|       0|      NULL|     58|Dominican Republic|     2|Central America &...|     NULL|Santo Domingo|18.456792|-69.951164|          1|       0|    NULL|   NULL|    1|    1|    1|        0|       NULL|           NULL|       0|      1|      0|          1|       Assassination|       NULL|           NULL|       NULL|           NULL|       14|Private Citizens ...|          68|      Named Civilian|                NULL|        Julio Guzman|     58|Dominican Republic|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|              MANO-D|    NULL|  NULL|     NULL|  NULL|     NULL|  NULL|          0|       NULL|       NULL|         0|  NULL|    NULL|   NULL|     NULL|         NULL|  NULL|      NULL|          NULL|  NULL|      NULL|          NULL|     NULL|       13|      Unknown|        NULL|                NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|      NULL|    1|   NULL|    NULL|     0|    NULL|    NULL|       0|      NULL|          NULL|     NULL|       NULL|        0|    NULL|      NULL|  NULL| NULL|  NULL|         NULL|     0|     NULL|       NULL|      NULL|        NULL|      NULL|          NULL|              NULL|     NULL|    NULL|  NULL|  NULL|  NULL|    PGIS|      0|       0|       0|      0|   NULL|
|197000000002| 1970|     0|   0|      NULL|       0|      NULL|    130|            Mexico|     1|       North America|  Federal|  Mexico city|19.371887|-99.086624|          1|       0|    NULL|   NULL|    1|    1|    1|        0|       NULL|           NULL|       0|      1|      0|          6|Hostage Taking (K...|       NULL|           NULL|       NULL|           NULL|        7|Government (Diplo...|          45|Diplomatic Person...|Belgian Ambassado...|Nadine Chaval, da...|     21|           Belgium|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|23rd of September...|    NULL|  NULL|     NULL|  NULL|     NULL|  NULL|          0|       NULL|       NULL|         0|     7|    NULL|   NULL|     NULL|         NULL|  NULL|      NULL|          NULL|  NULL|      NULL|          NULL|     NULL|       13|      Unknown|        NULL|                NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|      NULL|    0|   NULL|    NULL|     0|    NULL|    NULL|       0|      NULL|          NULL|     NULL|       NULL|        1|       1|         0|  NULL| NULL|  NULL|       Mexico|     1|   800000|       NULL|      NULL|        NULL|      NULL|          NULL|              NULL|     NULL|    NULL|  NULL|  NULL|  NULL|    PGIS|      0|       1|       1|      1|   NULL|
|197001000001| 1970|     1|   0|      NULL|       0|      NULL|    160|       Philippines|     5|      Southeast Asia|   Tarlac|      Unknown|15.478598|120.599741|          4|       0|    NULL|   NULL|    1|    1|    1|        0|       NULL|           NULL|       0|      1|      0|          1|       Assassination|       NULL|           NULL|       NULL|           NULL|       10| Journalists & Media|          54|Radio Journalist/...|    Voice of America|            Employee|    217|     United States|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|             Unknown|    NULL|  NULL|     NULL|  NULL|     NULL|  NULL|          0|       NULL|       NULL|         0|  NULL|    NULL|   NULL|     NULL|         NULL|  NULL|      NULL|          NULL|  NULL|      NULL|          NULL|     NULL|       13|      Unknown|        NULL|                NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|      NULL|    1|   NULL|    NULL|     0|    NULL|    NULL|       0|      NULL|          NULL|     NULL|       NULL|        0|    NULL|      NULL|  NULL| NULL|  NULL|         NULL|     0|     NULL|       NULL|      NULL|        NULL|      NULL|          NULL|              NULL|     NULL|    NULL|  NULL|  NULL|  NULL|    PGIS|     -9|      -9|       1|      1|   NULL|
|197001000002| 1970|     1|   0|      NULL|       0|      NULL|     78|            Greece|     8|      Western Europe|   Attica|       Athens| 37.99749| 23.762728|          1|       0|    NULL|   NULL|    1|    1|    1|        0|       NULL|           NULL|       0|      1|      0|          3|   Bombing/Explosion|       NULL|           NULL|       NULL|           NULL|        7|Government (Diplo...|          46|   Embassy/Consulate|                NULL|        U.S. Embassy|    217|     United States|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|             Unknown|    NULL|  NULL|     NULL|  NULL|     NULL|  NULL|          0|       NULL|       NULL|         0|  NULL|    NULL|   NULL|     NULL|         NULL|  NULL|      NULL|          NULL|  NULL|      NULL|          NULL|     NULL|        6|   Explosives|          16|Unknown Explosive...|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL| Explosive| NULL|   NULL|    NULL|  NULL|    NULL|    NULL|       1|      NULL|          NULL|     NULL|       NULL|        0|    NULL|      NULL|  NULL| NULL|  NULL|         NULL|     0|     NULL|       NULL|      NULL|        NULL|      NULL|          NULL|              NULL|     NULL|    NULL|  NULL|  NULL|  NULL|    PGIS|     -9|      -9|       1|      1|   NULL|
|197001000003| 1970|     1|   0|      NULL|       0|      NULL|    101|             Japan|     4|           East Asia|  Fukouka|      Fukouka|33.580412|130.396361|          1|       0|    NULL|   NULL|    1|    1|    1|       -9|       NULL|           NULL|       0|      1|      0|          7|Facility/Infrastr...|       NULL|           NULL|       NULL|           NULL|        7|Government (Diplo...|          46|   Embassy/Consulate|                NULL|      U.S. Consulate|    217|     United States|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|     NULL|         NULL|        NULL|            NULL| NULL|   NULL|   NULL|       NULL|             Unknown|    NULL|  NULL|     NULL|  NULL|     NULL|  NULL|          0|       NULL|       NULL|         0|  NULL|    NULL|   NULL|     NULL|         NULL|  NULL|      NULL|          NULL|  NULL|      NULL|          NULL|     NULL|        8|   Incendiary|        NULL|                NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|     NULL|         NULL|        NULL|            NULL|Incendiary| NULL|   NULL|    NULL|  NULL|    NULL|    NULL|       1|      NULL|          NULL|     NULL|       NULL|        0|    NULL|      NULL|  NULL| NULL|  NULL|         NULL|     0|     NULL|       NULL|      NULL|        NULL|      NULL|          NULL|              NULL|     NULL|    NULL|  NULL|  NULL|  NULL|    PGIS|     -9|      -9|       1|      1|   NULL|
+------------+-----+------+----+----------+--------+----------+-------+------------------+------+--------------------+---------+-------------+---------+----------+-----------+--------+--------+-------+-----+-----+-----+---------+-----------+---------------+--------+-------+-------+-----------+--------------------+-----------+---------------+-----------+---------------+---------+--------------------+------------+--------------------+--------------------+--------------------+-------+------------------+---------+-------------+------------+----------------+-----+-------+-------+-----------+---------+-------------+------------+----------------+-----+-------+-------+-----------+--------------------+--------+------+---------+------+---------+------+-----------+-----------+-----------+----------+------+--------+-------+---------+-------------+------+----------+--------------+------+----------+--------------+---------+---------+-------------+------------+--------------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+---------+-------------+------------+----------------+----------+-----+-------+--------+------+--------+--------+--------+----------+--------------+---------+-----------+---------+--------+----------+------+-----+------+-------------+------+---------+-----------+----------+------------+----------+--------------+------------------+---------+--------+------+------+------+--------+-------+--------+--------+-------+-------+
only showing top 5 rows

In [228]:
# Get last 5 rows
tail_rows = raw_gtd_df.collect()[-5:]
for row in tail_rows:
    print(row)
Row(eventid=201712310022, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=182, country_txt='Somalia', region=11, region_txt='Sub-Saharan Africa', provstate='Middle Shebelle', city='Ceelka Geelow', latitude=2.359673, longitude=45.385034, specificity=2, vicinity=0, location='The incident occurred near the town of Balcad.', summary='12/31/2017: Assailants opened fire on a Somali National Army (SNA) checkpoint in Ceelka Geelow, Middle Shebelle, Somalia. At least one soldier was killed and two soldiers were injured in the ensuing clash. Al-Shabaab claimed responsibility for the attack.', crit1='1', crit2='1', crit3='0', doubtterr='1', alternative='1', alternative_txt='Insurgency/Guerilla Action', multiple='0', success='1', suicide='0', attacktype1='2', attacktype1_txt='Armed Assault', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='4', targtype1_txt='Military', targsubtype1='36', targsubtype1_txt='Military Checkpoint', corp1='Somali National Army (SNA)', target1='Checkpoint', natlty1='182', natlty1_txt='Somalia', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Al-Shabaab', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='1', claimmode='10', claimmode_txt='Unknown', claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='5', weaptype1_txt='Firearms', weapsubtype1='5', weapsubtype1_txt='Unknown Gun Type', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail=None, nkill='1', nkillus='0', nkillter='0', nwound='2', nwoundus='0', nwoundte='0', property='-9', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Somalia: Al-Shabaab Militants Attack Army Checkpoint in Middle Shabeelle Region', scite2='"" Hiiraan Online', scite3=' January 1', dbsource=' 2018."', INT_LOG='"""Highlights: Somalia Daily Media Highlights 2 January 2018', INT_IDEO='"" Summary', INT_MISC=' January 3', INT_ANY=' 2018."', related='"""Highlights: Somalia Daily Media Highlights 1 January 2018')
Row(eventid=201712310029, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=200, country_txt='Syria', region=10, region_txt='Middle East & North Africa', provstate='Lattakia', city='Jableh', latitude=35.407278, longitude=35.942679, specificity=1, vicinity=1, location='The incident occurred at the Humaymim Airport.', summary='12/31/2017: Assailants launched mortars at the Hmeymim Air Base in Jableh, Lattakia, Syria. Two Russian soldiers were killed and ten people were injured in the attack. No group claimed responsibility for the incident; however, sources attributed the attack to Muslim extremists.', crit1='1', crit2='1', crit3='0', doubtterr='1', alternative='1', alternative_txt='Insurgency/Guerilla Action', multiple='0', success='1', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='4', targtype1_txt='Military', targsubtype1='27', targsubtype1_txt='Military Barracks/Base/Headquarters/Checkpost', corp1='Russian Air Force', target1='Hmeymim Air Base', natlty1='167', natlty1_txt='Russia', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Muslim extremists', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='11', weapsubtype1_txt='Projectile (rockets, mortars, RPGs, etc.)', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='Mortars were used in the attack.', nkill='2', nkillus='0', nkillter='0', nwound='7', nwoundus='0', nwoundte='0', property='1', propextent='4', propextent_txt='Unknown', propvalue='-99', propcomment='Seven military planes were damaged in this attack.', ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Putin\'s \'victory\' in Syria has turned into a farce - Turchynov', scite2='"" MENA English (Middle East and North Africa Financial Network)', scite3=' January 5', dbsource=' 2018."', INT_LOG='"""Two Russian soldiers killed at Hmeymim base in Syria', INT_IDEO='"" Ansamed', INT_MISC=' January 4', INT_ANY=' 2018."', related='"""Two Russian servicemen killed in Syria mortar attack')
Row(eventid=201712310030, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=160, country_txt='Philippines', region=5, region_txt='Southeast Asia', provstate='Maguindanao', city='Kubentog', latitude=6.900742, longitude=124.437908, specificity=2, vicinity=0, location='The incident occurred in the Datu Hoffer district.', summary='12/31/2017: Assailants set fire to houses in Kubentog, Datu Hoffer, Maguindanao, Philippines. There were no reported casualties in the attack. No group claimed responsibility for the incident; however, sources attributed the attack to the Bangsamoro Islamic Freedom Movement (BIFM).', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='1', suicide='0', attacktype1='7', attacktype1_txt='Facility/Infrastructure Attack', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='14', targtype1_txt='Private Citizens & Property', targsubtype1='76', targsubtype1_txt='House/Apartment/Residence', corp1='Not Applicable', target1='Houses', natlty1='160', natlty1_txt='Philippines', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Bangsamoro Islamic Freedom Movement (BIFM)', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='8', weaptype1_txt='Incendiary', weapsubtype1='18', weapsubtype1_txt='Arson/Fire', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail=None, nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='1', propextent='4', propextent_txt='Unknown', propvalue='-99', propcomment='Houses were damaged in this attack.', ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Maguindanao clashes trap tribe members', scite2='"" Philippines Daily Inquirer', scite3=' January 3', dbsource=' 2018."', INT_LOG=None, INT_IDEO=None, INT_MISC='START Primary Collection', INT_ANY='0', related='0')
Row(eventid=201712310031, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=92, country_txt='India', region=6, region_txt='South Asia', provstate='Manipur', city='Imphal', latitude=24.798346, longitude=93.94043, specificity=1, vicinity=0, location='The incident occurred in the Mantripukhri neighborhood.', summary='12/31/2017: Assailants threw a grenade at a Forest Department office in Mantripukhri neighborhood, Imphal, Manipur, India. No casualties were reported in the blast. No group claimed responsibility for the incident.', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='0', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='2', targtype1_txt='Government (General)', targsubtype1='21', targsubtype1_txt='Government Building/Facility/Office', corp1='Forest Department Manipur', target1='Office', natlty1='92', natlty1_txt='India', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Unknown', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='7', weapsubtype1_txt='Grenade', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='A thrown grenade was used in the attack.', nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='-9', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Trader escapes grenade attack in Imphal', scite2='"" Business Standard India', scite3=' January 3', dbsource=' 2018."', INT_LOG=None, INT_IDEO=None, INT_MISC='START Primary Collection', INT_ANY='-9', related='-9')
Row(eventid=201712310032, iyear=2017, imonth=12, iday=31, approxdate=None, extended=0, resolution=None, country=160, country_txt='Philippines', region=5, region_txt='Southeast Asia', provstate='Maguindanao', city='Cotabato City', latitude=7.209594, longitude=124.241966, specificity=1, vicinity=0, location=None, summary='12/31/2017: An explosive device was discovered and defused at a plaza in Cotabato City, Maguindanao, Philippines. No group claimed responsibility for the incident.', crit1='1', crit2='1', crit3='1', doubtterr='0', alternative=None, alternative_txt=None, multiple='0', success='0', suicide='0', attacktype1='3', attacktype1_txt='Bombing/Explosion', attacktype2=None, attacktype2_txt=None, attacktype3=None, attacktype3_txt=None, targtype1='20', targtype1_txt='Unknown', targsubtype1=None, targsubtype1_txt=None, corp1='Unknown', target1='Unknown', natlty1='160', natlty1_txt='Philippines', targtype2=None, targtype2_txt=None, targsubtype2=None, targsubtype2_txt=None, corp2=None, target2=None, natlty2=None, natlty2_txt=None, targtype3=None, targtype3_txt=None, targsubtype3=None, targsubtype3_txt=None, corp3=None, target3=None, natlty3=None, natlty3_txt=None, gname='Unknown', gsubname=None, gname2=None, gsubname2=None, gname3=None, gsubname3=None, motive=None, guncertain1='0', guncertain2=None, guncertain3=None, individual='0', nperps='-99', nperpcap='0', claimed='0', claimmode=None, claimmode_txt=None, claim2=None, claimmode2=None, claimmode2_txt=None, claim3=None, claimmode3=None, claimmode3_txt=None, compclaim=None, weaptype1='6', weaptype1_txt='Explosives', weapsubtype1='16', weapsubtype1_txt='Unknown Explosive Type', weaptype2=None, weaptype2_txt=None, weapsubtype2=None, weapsubtype2_txt=None, weaptype3=None, weaptype3_txt=None, weapsubtype3=None, weapsubtype3_txt=None, weaptype4=None, weaptype4_txt=None, weapsubtype4=None, weapsubtype4_txt=None, weapdetail='An explosive device containing a detonating cord, a battery, and a blasting cap was used in the attack.', nkill='0', nkillus='0', nkillter='0', nwound='0', nwoundus='0', nwoundte='0', property='0', propextent=None, propextent_txt=None, propvalue=None, propcomment=None, ishostkid='0', nhostkid=None, nhostkidus=None, nhours=None, ndays=None, divert=None, kidhijcountry=None, ransom=None, ransomamt=None, ransomamtus=None, ransompaid=None, ransompaidus=None, ransomnote=None, hostkidoutcome=None, hostkidoutcome_txt=None, nreleased=None, addnotes=None, scite1='"""Security tightened in Cotabato following IED discovery', scite2='"" Tempo', scite3=' January 4', dbsource=' 2018."', INT_LOG='"""Security tightened in Cotabato City', INT_IDEO='"" Manila Bulletin', INT_MISC=' January 3', INT_ANY=' 2018."', related=None)
In [229]:
from pyspark.sql import functions as F

# Select required columns
gtd_df = raw_gtd_df.select(
    'iyear', 'imonth', 'iday', 'region_txt', 'country_txt', 'provstate',
    'latitude', 'longitude', 'success', 'attacktype1_txt', 'targtype1_txt',
    'target1', 'weaptype1_txt', 'gname', 'suicide', 'nkill', 'nwound',
    'nkillter', 'summary', 'motive', 'propextent', 'dbsource'
)
In [230]:
# Rename columns for readability
gtd_df = gtd_df.withColumnRenamed('iyear', 'year') \
               .withColumnRenamed('imonth', 'month') \
               .withColumnRenamed('iday', 'day') \
               .withColumnRenamed('region_txt', 'region') \
               .withColumnRenamed('country_txt', 'country') \
               .withColumnRenamed('provstate', 'province') \
               .withColumnRenamed('attacktype1_txt', 'attack_type') \
               .withColumnRenamed('targtype1_txt', 'target_type') \
               .withColumnRenamed('target1', 'target') \
               .withColumnRenamed('weaptype1_txt', 'weapon_type') \
               .withColumnRenamed('gname', 'terror_group') \
               .withColumnRenamed('nkill', 'killed') \
               .withColumnRenamed('nwound', 'wounded') \
               .withColumnRenamed('nkillter', 'perpetrator_kill') \
               .withColumnRenamed('propextent', 'propextent')
In [231]:
rows = gtd_df.count()
cols = len(gtd_df.columns)
print(f"Shape: ({rows}, {cols})")
Shape: (181691, 22)

I then have effectively streamlined the dataset by dropping irrelevant or redundant columns and retaining only the most essential features related to the analysis of terrorist incidents. Specifically, I selected 22 key columns from the original 135, focusing on attributes such as the date and location of the attack (year, month, day, region, country, province, latitude, longitude), attack characteristics (success, attack_type, weapon_type, suicide), target and perpetrator details (target_type, target, terror_group, perpetrator_kill), casualty counts (killed, wounded), and additional contextual information (summary, motive, propextent, dbsource). I also renamed these columns to more readable and intuitive names, which improves the clarity of future analysis. As a result, the dataset was reduced to 181,691 rows and 22 columns, containing the most relevant information needed for in-depth terrorism data analysis.

In [232]:
 #To check the name of features.
# Check the new column names
print(gtd_df.columns)
['year', 'month', 'day', 'region', 'country', 'province', 'latitude', 'longitude', 'success', 'attack_type', 'target_type', 'target', 'weapon_type', 'terror_group', 'suicide', 'killed', 'wounded', 'perpetrator_kill', 'summary', 'motive', 'propextent', 'dbsource']
In [233]:
# Show first 5 rows
gtd_df.show(5)
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+
|year|month|day|              region|           country|province| latitude| longitude|success|         attack_type|         target_type|              target|weapon_type|        terror_group|suicide|killed|wounded|perpetrator_kill|summary|motive|propextent|dbsource|
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+
|1970|    7|  2|Central America &...|Dominican Republic|    NULL|18.456792|-69.951164|      1|       Assassination|Private Citizens ...|        Julio Guzman|    Unknown|              MANO-D|      0|     1|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|
|1970|    0|  0|       North America|            Mexico| Federal|19.371887|-99.086624|      1|Hostage Taking (K...|Government (Diplo...|Nadine Chaval, da...|    Unknown|23rd of September...|      0|     0|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|
|1970|    1|  0|      Southeast Asia|       Philippines|  Tarlac|15.478598|120.599741|      1|       Assassination| Journalists & Media|            Employee|    Unknown|             Unknown|      0|     1|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|
|1970|    1|  0|      Western Europe|            Greece|  Attica| 37.99749| 23.762728|      1|   Bombing/Explosion|Government (Diplo...|        U.S. Embassy| Explosives|             Unknown|      0|  NULL|   NULL|            NULL|   NULL|  NULL|      NULL|    PGIS|
|1970|    1|  0|           East Asia|             Japan| Fukouka|33.580412|130.396361|      1|Facility/Infrastr...|Government (Diplo...|      U.S. Consulate| Incendiary|             Unknown|      0|  NULL|   NULL|            NULL|   NULL|  NULL|      NULL|    PGIS|
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+
only showing top 5 rows

Feature Engineering¶

In [234]:
# Parse Dates
gtd_df = gtd_df.withColumn(
    'date',
    F.to_date(F.concat_ws('-', F.col('year'), F.col('month'), F.col('day')), 'yyyy-M-d')
)
In [235]:
# Add casualties column
gtd_df = gtd_df.withColumn(
    'casualties',
    F.coalesce(F.col('killed'), F.lit(0)) + F.coalesce(F.col('wounded'), F.lit(0))
)
In [236]:
rows = gtd_df.count()
cols = len(gtd_df.columns)
print(f"Shape: ({rows}, {cols})")
Shape: (181691, 24)
In [237]:
# Show first 5 rows
gtd_df.show(5)
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+
|year|month|day|              region|           country|province| latitude| longitude|success|         attack_type|         target_type|              target|weapon_type|        terror_group|suicide|killed|wounded|perpetrator_kill|summary|motive|propextent|dbsource|      date|casualties|
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+
|1970|    7|  2|Central America &...|Dominican Republic|    NULL|18.456792|-69.951164|      1|       Assassination|Private Citizens ...|        Julio Guzman|    Unknown|              MANO-D|      0|     1|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|1970-07-02|       1.0|
|1970|    0|  0|       North America|            Mexico| Federal|19.371887|-99.086624|      1|Hostage Taking (K...|Government (Diplo...|Nadine Chaval, da...|    Unknown|23rd of September...|      0|     0|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|      NULL|       0.0|
|1970|    1|  0|      Southeast Asia|       Philippines|  Tarlac|15.478598|120.599741|      1|       Assassination| Journalists & Media|            Employee|    Unknown|             Unknown|      0|     1|      0|            NULL|   NULL|  NULL|      NULL|    PGIS|      NULL|       1.0|
|1970|    1|  0|      Western Europe|            Greece|  Attica| 37.99749| 23.762728|      1|   Bombing/Explosion|Government (Diplo...|        U.S. Embassy| Explosives|             Unknown|      0|  NULL|   NULL|            NULL|   NULL|  NULL|      NULL|    PGIS|      NULL|       0.0|
|1970|    1|  0|           East Asia|             Japan| Fukouka|33.580412|130.396361|      1|Facility/Infrastr...|Government (Diplo...|      U.S. Consulate| Incendiary|             Unknown|      0|  NULL|   NULL|            NULL|   NULL|  NULL|      NULL|    PGIS|      NULL|       0.0|
+----+-----+---+--------------------+------------------+--------+---------+----------+-------+--------------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+----------------+-------+------+----------+--------+----------+----------+
only showing top 5 rows

For feature engineering, I added a new column called casualties, which combines the number of people killed and wounded in each incident to provide a more comprehensive measure of the total human impact of an attack. This variable is crucial for later severity analysis, clustering, and risk scoring, as it captures the full extent of harm caused. Additionally, I parsed the date from the separate year, month, and day columns into a single date column using pd.to_datetime, enabling easier time-series analysis, trend forecasting, and chronological filtering. These feature engineering steps are important because they create new, analytically valuable variables and transform the dataset into a format that is more suitable for visualization, modeling, and insightful interpretation. This increased the no. of columns from 22 to 24.

In [238]:
# Method 1: Print schema nicely
gtd_df.printSchema()

# Method 2: Get a list of column names and their types
gtd_df.dtypes
root
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- region: string (nullable = true)
 |-- country: string (nullable = true)
 |-- province: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- success: string (nullable = true)
 |-- attack_type: string (nullable = true)
 |-- target_type: string (nullable = true)
 |-- target: string (nullable = true)
 |-- weapon_type: string (nullable = true)
 |-- terror_group: string (nullable = true)
 |-- suicide: string (nullable = true)
 |-- killed: string (nullable = true)
 |-- wounded: string (nullable = true)
 |-- perpetrator_kill: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- motive: string (nullable = true)
 |-- propextent: string (nullable = true)
 |-- dbsource: string (nullable = true)
 |-- date: date (nullable = true)
 |-- casualties: double (nullable = true)

Out[238]:
[('year', 'int'),
 ('month', 'int'),
 ('day', 'int'),
 ('region', 'string'),
 ('country', 'string'),
 ('province', 'string'),
 ('latitude', 'double'),
 ('longitude', 'double'),
 ('success', 'string'),
 ('attack_type', 'string'),
 ('target_type', 'string'),
 ('target', 'string'),
 ('weapon_type', 'string'),
 ('terror_group', 'string'),
 ('suicide', 'string'),
 ('killed', 'string'),
 ('wounded', 'string'),
 ('perpetrator_kill', 'string'),
 ('summary', 'string'),
 ('motive', 'string'),
 ('propextent', 'string'),
 ('dbsource', 'string'),
 ('date', 'date'),
 ('casualties', 'double')]

Initial EDA¶

In [163]:
from pyspark.sql import functions as F

numerical_vars = ['killed', 'wounded', 'casualties', 'suicide', 'year', 'latitude', 'longitude']

# 1. Count, mean, stddev
summary_stats = gtd_df.select(numerical_vars).describe().filter(F.col("summary").isin("count", "mean", "stddev"))
summary_stats.show(truncate=False)

# 2. Median
median_df = gtd_df.select([F.expr(f'percentile_approx({c}, 0.5)').alias(c) for c in numerical_vars])
median_df.show(truncate=False)

# 3. Mode
mode_dict = {}
for c in numerical_vars:
    mode_val = gtd_df.groupBy(c).count().orderBy(F.desc('count')).first()[0]
    mode_dict[c] = mode_val

print("Mode values:")
for k, v in mode_dict.items():
    print(f"{k}: {v}")
+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+
|summary|killed            |wounded           |casualties       |suicide            |year              |latitude          |longitude         |
+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+
|count  |170794            |165291            |181476           |181606             |181691            |177135            |177134            |
|mean   |2.4038225968104   |3.1526405286503563|5.130154951618947|0.03772992620332636|2002.6389969783863|23.49834295928318 |-458.6956530247027|
|stddev |11.554775970212424|35.9396365074308  |40.55104567528549|0.19091822119542906|13.259430466246835|18.569242421025763|204778.9886113944 |
+-------+------------------+------------------+-----------------+-------------------+------------------+------------------+------------------+

+------+-------+----------+-------+----+---------+---------+
|killed|wounded|casualties|suicide|year|latitude |longitude|
+------+-------+----------+-------+----+---------+---------+
|0.0   |0.0    |1.0       |0.0    |2009|31.467463|43.243996|
+------+-------+----------+-------+----+---------+---------+

Mode values:
killed: 0
wounded: 0
casualties: 0.0
suicide: 0
year: 2014
latitude: 33.303566
longitude: 44.371773

An initial exploratory analysis was conducted on the numerical variables of the Global Terrorism Database, including killed, wounded, casualties, suicide, year, latitude, and longitude. The dataset contains varying numbers of observations per variable, with counts ranging from 165,291 for wounded to 181,606 for suicide. On average, incidents resulted in approximately 2.4 deaths, 3.15 wounded, and 5.13 total casualties, with a very small proportion (0.038) involving suicide attacks. The standard deviations indicate high variability in casualties and geographic coordinates, reflecting the sporadic and widespread nature of terrorist incidents. Median values show that more than half of the incidents involved zero deaths, zero wounded, and zero suicide attacks, highlighting that many events were minor in scale. Mode analysis confirms that the most frequently reported values were zero for casualties, deaths, and wounded, with the most common year being 2014, and the most common locations concentrated around latitude 33.3 and longitude 44.37. Overall, these statistics provide a foundational understanding of the distribution, central tendency, and variability in the dataset, guiding further analysis.

Data Preprocessing¶

Preprocessing for Numerical Variables¶

Now for data preprocessing, I have first done the preprocessing for numerical variables.

Handling Missing Values¶

To ensure the completeness and reliability of the dataset, missing values were systematically identified and addressed. A custom function was created to quantify both the absolute and relative extent of missing values across columns. Based on this analysis, columns such as motive, propextent, and perpetrator_kill were removed due to having a high proportion of missing data, rendering them unsuitable for analysis. The summary column, primarily textual in nature, had numerous missing entries, which were replaced with the placeholder "Unknown" to retain the structure without introducing bias. For numerical variables like killed and wounded, where missing values were relatively sparse, median imputation was employed to maintain the distribution of the data and prevent skewness that could arise from extreme values.

In [239]:
from pyspark.sql.types import IntegerType

# Cast numeric columns that are stored as strings
numeric_cols = ['success', 'suicide', 'killed', 'wounded', 'casualties']
for col in numeric_cols:
    gtd_df = gtd_df.withColumn(col, gtd_df[col].cast(IntegerType()))
In [240]:
from pyspark.sql.functions import col, sum, count, round

# total number of rows
total_rows = gtd_df.count()

# missing values count & percentage for each column
missing_df = gtd_df.select([
    sum(col(c).isNull().cast("int")).alias(c + "_missing") for c in gtd_df.columns
])

# reshape into column, missing_count, missing_percent
missing_long = (
    missing_df.selectExpr("stack(" + str(len(gtd_df.columns)) + "," +
                          ",".join([f"'{c}', {c}_missing" for c in gtd_df.columns]) +
                          ") as (column_name, missing_count)")
    .withColumn("missing_percent", round((col("missing_count")/total_rows)*100, 2))
)

missing_long.orderBy(col("missing_percent").desc()).show(truncate=False)
+----------------+-------------+---------------+
|column_name     |missing_count|missing_percent|
+----------------+-------------+---------------+
|motive          |131567       |72.41          |
|propextent      |117295       |64.56          |
|perpetrator_kill|67146        |36.96          |
|summary         |66129        |36.4           |
|wounded         |16440        |9.05           |
|killed          |11074        |6.09           |
|latitude        |4556         |2.51           |
|longitude       |4557         |2.51           |
|date            |891          |0.49           |
|dbsource        |877          |0.48           |
|target          |682          |0.38           |
|terror_group    |487          |0.27           |
|weapon_type     |436          |0.24           |
|province        |421          |0.23           |
|target_type     |263          |0.14           |
|casualties      |215          |0.12           |
|success         |207          |0.11           |
|suicide         |111          |0.06           |
|attack_type     |35           |0.02           |
|year            |0            |0.0            |
+----------------+-------------+---------------+
only showing top 20 rows

In [241]:
# -----------------------------
# 1. Remove columns with high missing proportion
# -----------------------------
# In your pandas code: ['motive','propextent','perpetrator_kill']
gtd_df = gtd_df.drop('motive', 'propextent', 'perpetrator_kill')

# -----------------------------
# 2. Fill 'summary' column nulls with 'Unknown'
# -----------------------------
gtd_df = gtd_df.withColumn('summary', F.coalesce(F.col('summary'), F.lit('Unknown')))

# -----------------------------
# 3. Fill 'killed' and 'wounded' nulls with median
# -----------------------------
# Calculate medians
median_killed = gtd_df.approxQuantile("killed", [0.5], 0.0)[0]
median_wounded = gtd_df.approxQuantile("wounded", [0.5], 0.0)[0]

gtd_df = gtd_df.withColumn("killed", F.coalesce(F.col("killed"), F.lit(median_killed)))
gtd_df = gtd_df.withColumn("wounded", F.coalesce(F.col("wounded"), F.lit(median_wounded)))
In [242]:
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape: ({num_rows}, {num_cols})")
Shape: (181691, 21)
In [243]:
# total number of rows
total_rows = gtd_df.count()

# missing values count & percentage for each column
missing_df = gtd_df.select([
    sum(col(c).isNull().cast("int")).alias(c + "_missing") for c in gtd_df.columns
])

# reshape into column, missing_count, missing_percent
missing_long = (
    missing_df.selectExpr("stack(" + str(len(gtd_df.columns)) + "," +
                          ",".join([f"'{c}', {c}_missing" for c in gtd_df.columns]) +
                          ") as (column_name, missing_count)")
    .withColumn("missing_percent", round((col("missing_count")/total_rows)*100, 2))
)

missing_long.orderBy(col("missing_percent").desc()).show(truncate=False)
+------------+-------------+---------------+
|column_name |missing_count|missing_percent|
+------------+-------------+---------------+
|latitude    |4556         |2.51           |
|longitude   |4557         |2.51           |
|date        |891          |0.49           |
|dbsource    |877          |0.48           |
|target      |682          |0.38           |
|terror_group|487          |0.27           |
|weapon_type |436          |0.24           |
|province    |421          |0.23           |
|target_type |263          |0.14           |
|casualties  |215          |0.12           |
|success     |207          |0.11           |
|suicide     |111          |0.06           |
|attack_type |35           |0.02           |
|year        |0            |0.0            |
|month       |0            |0.0            |
|day         |0            |0.0            |
|region      |0            |0.0            |
|country     |0            |0.0            |
|killed      |0            |0.0            |
|wounded     |0            |0.0            |
+------------+-------------+---------------+
only showing top 20 rows

In [244]:
import matplotlib.pyplot as plt
import pandas as pd
from pyspark.sql.functions import col, sum

# Total rows
total_rows = gtd_df.count()

# Compute missing values per column
missing_df = gtd_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in gtd_df.columns])

# Convert to Pandas for visualization
missing_pd = missing_df.toPandas().T.reset_index()
missing_pd.columns = ['column', 'missing_count']
missing_pd['missing_percent'] = (missing_pd['missing_count'] / total_rows) * 100

# Plot using Matplotlib
plt.figure(figsize=(10,6))
plt.bar(missing_pd['column'], missing_pd['missing_percent'], color=(0.75,0.75,0.475))
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.ylabel('% of Missing Values', fontsize=12)
plt.title('Missing Values per Column', fontsize=14)
plt.show()
In [ ]:
# visualising where exactly the missing values are
msno.matrix(gtd_df,figsize = (10,6),fontsize = 12,color = (0.75,.50,0.25))
Out[ ]:
<Axes: >

Handling Duplicates Values¶

Duplicate records in the dataset can lead to biased analysis, especially in frequency-based or aggregate computations. A check for duplicated rows revealed 3,170 duplicates within the dataset. These redundant entries were promptly removed using the drop_duplicates() function. Following this operation, the number of records was reduced from 181,691 to 172,141, ensuring each incident was uniquely represented and analytical results were not distorted by repeated cases.

In [245]:
# -----------------------------
# 1. Count duplicate rows
# -----------------------------
# Create a hash column of all columns to identify duplicates
from pyspark.sql.functions import concat_ws

gtd_df_with_hash = gtd_df.withColumn("row_hash", concat_ws("_", *gtd_df.columns))
duplicate_count = gtd_df_with_hash.groupBy("row_hash").count().filter(F.col("count") > 1).count()
print("Number of duplicate rows:", duplicate_count)
Number of duplicate rows: 3170
In [246]:
# -----------------------------
# 2. Display duplicate rows (optional)
# -----------------------------
duplicate_rows_df = gtd_df_with_hash.groupBy(gtd_df.columns)\
                                    .count()\
                                    .filter(F.col("count") > 1)\
                                    .drop("count")
duplicate_rows_df.show(truncate=False)
+----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+
|year|month|day|region                     |country       |province             |latitude  |longitude |success|attack_type                   |target_type                |target                   |weapon_type|terror_group                                    |suicide|killed|wounded|summary|dbsource|date      |casualties|
+----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+
|1979|1    |6  |Western Europe             |Italy         |Lazio                |41.890961 |12.490069 |1      |Facility/Infrastructure Attack|Business                   |Movie theater            |Incendiary |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1979-01-06|0         |
|1986|2    |21 |Central America & Caribbean|El Salvador   |Cuscatlan            |13.682638 |-88.926466|1      |Bombing/Explosion             |Utilities                  |electrical line post     |Explosives |Farabundo Marti National Liberation Front (FMLN)|0      |0.0   |0.0    |Unknown|PGIS    |1986-02-21|0         |
|1987|4    |9  |South America              |Peru          |Lima                 |-11.967368|-76.978462|0      |Bombing/Explosion             |Private Citizens & Property|Street                   |Explosives |Shining Path (SL)                               |0      |0.0   |0.0    |Unknown|PGIS    |1987-04-09|0         |
|1989|5    |17 |Central America & Caribbean|El Salvador   |Cabanas              |13.864829 |-88.7494  |1      |Bombing/Explosion             |Utilities                  |115,000 Volt Power Line  |Explosives |Farabundo Marti National Liberation Front (FMLN)|0      |0.0   |0.0    |Unknown|PGIS    |1989-05-17|0         |
|1991|4    |5  |South America              |Peru          |Lima                 |-11.975814|-76.7699  |1      |Bombing/Explosion             |Utilities                  |High Tension Power Lines |Explosives |Shining Path (SL)                               |0      |0.0   |0.0    |Unknown|PGIS    |1991-04-05|0         |
|1991|5    |27 |Central America & Caribbean|El Salvador   |Usulutan             |13.516667 |-88.383333|1      |Bombing/Explosion             |Utilities                  |High tension line tower* |Explosives |Farabundo Marti National Liberation Front (FMLN)|0      |0.0   |0.0    |Unknown|PGIS    |1991-05-27|0         |
|1992|7    |14 |Middle East & North Africa |Lebanon       |North                |34.438094 |35.830837 |1      |Bombing/Explosion             |Private Citizens & Property|Beach                    |Explosives |Unknown                                         |0      |0.0   |1.0    |Unknown|PGIS    |1992-07-14|1         |
|1992|10   |30 |Middle East & North Africa |Turkey        |Istanbul             |41.106178 |28.689863 |1      |Bombing/Explosion             |Government (General)       |Election Bureau          |Explosives |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1992-10-30|0         |
|1994|5    |9  |Middle East & North Africa |Turkey        |Adana                |36.99154  |35.331051 |1      |Bombing/Explosion             |Business                   |Automatic Teller Machine |Explosives |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1994-05-09|0         |
|1997|11   |29 |Western Europe             |Spain         |Basque Country       |43.07563  |-2.223667 |1      |Unknown                       |Business                   |Bank                     |Unknown    |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1997-11-29|0         |
|1983|10   |28 |South America              |Chile         |Unknown              |NULL      |NULL      |1      |Bombing/Explosion             |Transportation             |Chilean railway line     |Explosives |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1983-10-28|0         |
|1986|9    |1  |South America              |Chile         |Santiago Metropolitan|-33.366238|-70.505302|1      |Facility/Infrastructure Attack|Transportation             |Bus                      |Incendiary |Manuel Rodriguez Patriotic Front (FPMR)         |0      |0.0   |0.0    |Unknown|PGIS    |1986-09-01|0         |
|1981|5    |8  |Western Europe             |Greece        |Attica               |37.99749  |23.762728 |1      |Bombing/Explosion             |Police                     |Police Station           |Explosives |Revolutionary People's Struggle (ELA)           |0      |0.0   |0.0    |Unknown|PGIS    |1981-05-08|0         |
|1983|11   |10 |South America              |Peru          |Lima                 |-11.967368|-76.978462|1      |Bombing/Explosion             |Utilities                  |High tension tower       |Explosives |Shining Path (SL)                               |0      |0.0   |0.0    |Unknown|PGIS    |1983-11-10|0         |
|1984|1    |5  |Middle East & North Africa |Lebanon       |South                |33.550434 |35.370964 |1      |Bombing/Explosion             |Military                   |Israeli Military Unit    |Explosives |Shia Muslim extremists                          |0      |0.0   |0.0    |Unknown|PGIS    |1984-01-05|0         |
|1986|1    |3  |South America              |Peru          |Lima                 |-12.707508|-75.969184|1      |Bombing/Explosion             |Utilities                  |High tension line tower  |Explosives |Shining Path (SL)                               |0      |0.0   |0.0    |Unknown|PGIS    |1986-01-03|0         |
|1986|7    |1  |South America              |Chile         |Santiago Metropolitan|-33.366238|-70.505302|1      |Bombing/Explosion             |Business                   |a yarrow processing plant|Unknown    |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1986-07-01|0         |
|1987|6    |13 |Western Europe             |Malta         |South Eastern        |35.89779  |14.514106 |1      |Facility/Infrastructure Attack|Business                   |Hotel                    |Incendiary |Unknown                                         |0      |0.0   |0.0    |Unknown|PGIS    |1987-06-13|0         |
|1988|8    |27 |Western Europe             |United Kingdom|Northern Ireland     |55.011562 |-7.312045 |1      |Bombing/Explosion             |Private Citizens & Property|Street                   |Explosives |Irish Republican Army (IRA)                     |0      |0.0   |1.0    |Unknown|PGIS    |1988-08-27|1         |
|1990|8    |28 |Western Europe             |Spain         |Basque Country       |43.291618 |-1.977903 |1      |Bombing/Explosion             |Business                   |bar                      |Explosives |Basque Fatherland and Freedom (ETA)             |0      |0.0   |0.0    |Unknown|PGIS    |1990-08-28|0         |
+----+-----+---+---------------------------+--------------+---------------------+----------+----------+-------+------------------------------+---------------------------+-------------------------+-----------+------------------------------------------------+-------+------+-------+-------+--------+----------+----------+
only showing top 20 rows

In [247]:
# -----------------------------
# 3. Remove duplicate rows
# -----------------------------
gtd_df = gtd_df.dropDuplicates()
In [248]:
# -----------------------------
# 4. Check new shape
# -----------------------------
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape after removing duplicates: ({num_rows}, {num_cols})")
Shape after removing duplicates: (172141, 21)

Handling Outliers¶

Outliers in numerical features like killed, wounded, and casualties can disproportionately influence statistical analyses and model performance. To mitigate their impact, the Interquartile Range (IQR) method was applied. The first (Q1) and third quartiles (Q3) were computed for each variable to calculate the IQR, and upper thresholds were set at Q3 + 1.5*IQR. Rather than removing the identified outliers, a capping approach was used, replacing extreme values beyond the upper bound with the threshold value. This strategy preserved the data structure and sample size while limiting the influence of anomalously high values, particularly in terrorism-related incidents where some cases may report exceptionally large numbers of casualties.

In [249]:
# Summary statistics for selected numeric columns
gtd_df.select("killed", "wounded", "casualties").describe().show()
+-------+------------------+-----------------+------------------+
|summary|            killed|          wounded|        casualties|
+-------+------------------+-----------------+------------------+
|  count|            172141|           172141|            171927|
|   mean|2.3614827379880445|3.008870635118885|5.3763981224589505|
| stddev|11.496217161927367|35.21393062470437|41.637251045678035|
|    min|              -9.0|              0.0|                -4|
|    max|            1570.0|           8191.0|              9574|
+-------+------------------+-----------------+------------------+

In [10]:
# Select only killed, wounded, casualties
gtd_df_pd = gtd_df.select("killed", "wounded", "casualties").toPandas()

# Melt for Plotly
melted_df = gtd_df_pd.melt(var_name="Metric", value_name="Count")

# Box plot
fig = px.box(
    melted_df,
    x="Metric",
    y="Count",
    title="Distribution of Killed, Wounded, and Casualties",
    color="Metric",
    points="outliers"  # show outliers
)
fig.show()
In [251]:
def iqr_threshold(df, col):
    Q1, Q3 = df.approxQuantile(col, [0.25, 0.75], 0.0)
    IQR = Q3 - Q1
    upper_thresh = Q3 + 1.5 * IQR
    lower_thresh = Q1 - 1.5 * IQR
    return lower_thresh, upper_thresh

thresholds = {col: iqr_threshold(gtd_df, col) for col in numeric_cols}
In [252]:
for col in numeric_cols:
    _, upper = thresholds[col]
    gtd_df = gtd_df.withColumn(col, F.when(F.col(col) > upper, upper).otherwise(F.col(col)))
In [12]:
# Select only killed, wounded, casualties
gtd_df_pd = gtd_df.select("killed", "wounded", "casualties").toPandas()

# Melt for Plotly
melted_df = gtd_df_pd.melt(var_name="Metric", value_name="Count")

# Box plot
fig = px.box(
    melted_df,
    x="Metric",
    y="Count",
    title="Distribution of Killed, Wounded, and Casualties",
    color="Metric",
    points="outliers"  # show outliers
)
fig.show()
In [254]:
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape: ({num_rows}, {num_cols})")
Shape: (172141, 21)

Preprocessing for Categorical Variables¶

Then, I performed preprocessing for categorical variables to enhance consistency and prepare the data for analysis. Text fields often contain inconsistencies due to varying cases, spacing, or formatting, which can lead to redundant categories and reduce model performance. To address this, the summary column was standardized by converting all text to lowercase using the str.lower() function. Additionally, for key categorical columns such as region, country, province, attack_type, target_type, weapon_type, and terror_group, I applied both str.lower() and str.strip() to ensure that values are case-insensitive and free from leading or trailing whitespaces. This step was essential to avoid duplicate labels and to maintain uniformity across the dataset, thereby improving the reliability of downstream analyses like encoding, grouping, and classification.

In [255]:
# -----------------------------
# 1. Standardize 'summary' column
# -----------------------------
gtd_df = gtd_df.withColumn("summary", F.lower(F.col("summary")))
In [256]:
# -----------------------------
# 2. Standardize multiple text columns
# -----------------------------
text_cols = ['region', 'country', 'province', 'attack_type', 'target_type',
             'weapon_type', 'terror_group']

for col in text_cols:
    gtd_df = gtd_df.withColumn(col, F.lower(F.col(col))) \
                   .withColumn(col, F.trim(F.col(col)))
In [257]:
# -----------------------------
# 3. Check updated shape and preview
# -----------------------------
num_rows = gtd_df.count()
num_cols = len(gtd_df.columns)
print(f"Shape: ({num_rows}, {num_cols})")

gtd_df.show(5, truncate=False)
Shape: (172141, 21)
+----+-----+---+--------------------------+--------------+----------------+---------+-----------+-------+-----------------+------------------------------+----------------------------------------------------+-----------+----------------------------+-------+------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+----------+
|year|month|day|region                    |country       |province        |latitude |longitude  |success|attack_type      |target_type                   |target                                              |weapon_type|terror_group                |suicide|killed|wounded|summary                                                                                                                                                                                                                                                                                       |dbsource                          |date      |casualties|
+----+-----+---+--------------------------+--------------+----------------+---------+-----------+-------+-----------------+------------------------------+----------------------------------------------------+-----------+----------------------------+-------+------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+----------+
|1970|2    |28 |middle east & north africa|jordan        |khalil          |31.530243|35.094162  |1.0    |armed assault    |tourists                      |Tourist Bus                                         |firearms   |unknown                     |0.0    |0.0   |0.0    |unknown                                                                                                                                                                                                                                                                                       |PGIS                              |1970-02-28|0.0       |
|1970|5    |28 |north america             |united states |arizona         |33.44826 |-112.075774|0.0    |bombing/explosion|government (general)          |U.S. Department of Labor bus, Phoenix Arizona       |explosives |left-wing militants         |0.0    |0.0   |0.0    |5/28/1970: unknown perpetrators connected a bomb consisting of tnt to the engine of a u.s. department of labor vehicle in phoenix, arizona, united states that transported people to a government training program.  the bomb did not explode because it was believed to be wired incorrectly.|"" U.S. Government Printing Office|1970-05-28|0.0       |
|1970|6    |27 |western europe            |united kingdom|northern ireland|54.607712|-5.95621   |1.0    |armed assault    |religious figures/institutions|St. Matthew                                         |firearms   |ulster volunteer force (uvf)|0.0    |3.0   |1.0    |unknown                                                                                                                                                                                                                                                                                       |CAIN                              |1970-06-27|4.0       |
|1970|7    |7  |north america             |united states |new york        |40.697132|-73.931351 |1.0    |bombing/explosion|business                      |Portuguese Travel/Info center                       |explosives |unknown                     |0.0    |0.0   |0.0    |unknown                                                                                                                                                                                                                                                                                       |PGIS                              |1970-07-07|0.0       |
|1970|7    |23 |north america             |united states |california      |34.097866|-118.407379|1.0    |bombing/explosion|government (general)          |California Highway Patrol office, Oakland California|explosives |left-wing militants         |0.0    |0.0   |0.0    |7/23/1970: unknown perpetrators, located on the grove-shafter freeway, threw a bomb at the california highway patrol office in oakland, california, united states.  the bomb landed fifty feet away from the building and created a small hole in the ground.                                 | 1970."                           |1970-07-23|0.0       |
+----+-----+---+--------------------------+--------------+----------------+---------+-----------+-------+-----------------+------------------------------+----------------------------------------------------+-----------+----------------------------+-------+------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------+----------+
only showing top 5 rows

Summary Statistics¶

To begin the descriptive analysis, I performed summary statistics on key numerical variables, including killed, wounded, casualties, suicide, year, latitude, and longitude. The initial descriptive table provided insights into the central tendency and spread of the data. For instance, both killed and wounded had a mean slightly above 1, with medians of 0, indicating a highly right-skewed distribution, where most incidents result in no casualties, but a few extreme cases contribute to higher averages. The variables Killed, Wounded, and Casualties have mean values of 1.19, 1.13, and 2.68 respectively, with standard deviations of 1.68, 1.81, and 3.43, indicating that while most incidents involve a small number of victims, there is variability in the impact of attacks. The suicide variable also reflected a strong skew, with a mean of 0.038, showing that suicide attacks are relatively rare. Variance and standard deviation metrics showed a considerable spread in casualty figures. Further statistical measures like mode, skewness, and kurtosis confirmed the presence of asymmetry and heavy tails in variables such as suicide and longitude.

In [185]:
# -----------------------------
# 1. Summary for selected numerical variables
# -----------------------------
numerical_vars = ['killed', 'wounded', 'casualties', 'suicide', 'year', 'latitude', 'longitude']
summary_selected = gtd_df.select(numerical_vars).describe()
summary_selected.show(truncate=False)
+-------+------------------+------------------+------------------+-------+------------------+------------------+------------------+
|summary|killed            |wounded           |casualties        |suicide|year              |latitude          |longitude         |
+-------+------------------+------------------+------------------+-------+------------------+------------------+------------------+
|count  |172141            |172141            |171927            |172030 |172141            |168169            |168168            |
|mean   |1.1851156900447888|1.1267042714983646|2.6842729763213455|0.0    |2003.0583358990596|23.780424424403982|-483.008697948218 |
|stddev |1.6796319857787025|1.8117990014862626|3.4342194295821695|0.0    |13.187466057351415|18.295846863438953|210167.07786337263|
|min    |-9.0              |0.0               |-4.0              |0.0    |1970              |-53.154613        |-8.6185896E7      |
|max    |5.0               |5.0               |10.0              |0.0    |2017              |74.633553         |179.366667        |
+-------+------------------+------------------+------------------+-------+------------------+------------------+------------------+

In [186]:
# -----------------------------
# 2. Median using approxQuantile
# -----------------------------
median_values = {}
for col in numerical_vars:
    median_values[col] = gtd_df.approxQuantile(col, [0.5], 0.0)[0]

print("Median values:")
print(median_values)

# -----------------------------
# 4. Mode (most frequent value)
# -----------------------------
mode_values = {}
for col in numerical_vars:
    mode_row = gtd_df.groupBy(col).count().orderBy(F.desc("count")).first()
    mode_values[col] = mode_row[0] if mode_row else None

print("Mode values:")
print(mode_values)

# -----------------------------
# 5. Variance, Skewness, Kurtosis
# -----------------------------
stats_df = gtd_df.select(
    [F.variance(col).alias(f"{col}_var") for col in numerical_vars] +
    [F.skewness(col).alias(f"{col}_skew") for col in numerical_vars] +
    [F.kurtosis(col).alias(f"{col}_kurt") for col in numerical_vars]
)
stats_df.show(truncate=False)
Median values:
{'killed': 0.0, 'wounded': 0.0, 'casualties': 1.0, 'suicide': 0.0, 'year': 2009.0, 'latitude': 31.5282, 'longitude': 43.526192}
Mode values:
{'killed': 0.0, 'wounded': 0.0, 'casualties': 0.0, 'suicide': 0.0, 'year': 2014, 'latitude': 33.303566, 'longitude': 44.371773}
+------------------+------------------+-----------------+-----------+------------------+----------------+--------------------+------------------+-----------------+-----------------+------------+-------------------+-------------------+------------------+-------------------+-------------------+---------------------+------------+-------------------+------------------+-----------------+
|killed_var        |wounded_var       |casualties_var   |suicide_var|year_var          |latitude_var    |longitude_var       |killed_skew       |wounded_skew     |casualties_skew  |suicide_skew|year_skew          |latitude_skew      |longitude_skew    |killed_kurt        |wounded_kurt       |casualties_kurt      |suicide_kurt|year_kurt          |latitude_kurt     |longitude_kurt   |
+------------------+------------------+-----------------+-----------+------------------+----------------+--------------------+------------------+-----------------+-----------------+------------+-------------------+-------------------+------------------+-------------------+-------------------+---------------------+------------+-------------------+------------------+-----------------+
|2.8211636076509077|3.2826156217866185|11.79386309051968|0.0        |173.90926101379569|334.738012450409|4.417020061762894E10|1.3428747652791244|1.338641369146701|1.192927082842646|NULL        |-0.6952796439100506|-0.9774632615990559|-410.0792155185672|0.44628601459789685|0.16546045782746477|-0.014236486342569243|NULL        |-0.9310982234231573|0.9576496778248913|168162.9753356737|
+------------------+------------------+-----------------+-----------+------------------+----------------+--------------------+------------------+-----------------+-----------------+------------+-------------------+-------------------+------------------+-------------------+-------------------+---------------------+------------+-------------------+------------------+-----------------+

Exploratory Data Analysis and Visualizations¶

Then I performed a comprehensive Exploratory Data Analysis.

For the exploratory data analysis (EDA) part, I am using pandas instead of Spark because I want to leverage Plotly for visualizations. While Spark is excellent for processing and analyzing large distributed datasets efficiently, its native visualization capabilities are limited. Plotly, on the other hand, provides interactive and highly customizable plots, which are very useful during EDA to explore distributions, trends, correlations, and outliers. By converting the Spark DataFrame to a pandas DataFrame, I can take advantage of Plotly’s interactive charts while still working with the cleaned and preprocessed data. This approach combines the scalability of Spark for data processing with the visual power of Plotly for insights.

In [82]:
# -----------------------------
# Convert Spark DataFrame to pandas
# -----------------------------
# Overwrite gtd_df with its pandas version
gtd_df = gtd_df.toPandas()

# Preview the first few rows
gtd_df.head()
Out[82]:
year month day region country province latitude longitude success attack_type ... target weapon_type terror_group suicide killed wounded summary dbsource date casualties
0 1970 2 28 middle east & north africa jordan khalil 31.530243 35.094162 1.0 armed assault ... Tourist Bus firearms unknown 0.0 0.0 0.0 unknown PGIS 1970-02-28 0.0
1 1970 5 28 north america united states arizona 33.448260 -112.075774 0.0 bombing/explosion ... U.S. Department of Labor bus, Phoenix Arizona explosives left-wing militants 0.0 0.0 0.0 5/28/1970: unknown perpetrators connected a bo... "" U.S. Government Printing Office 1970-05-28 0.0
2 1970 6 27 western europe united kingdom northern ireland 54.607712 -5.956210 1.0 armed assault ... St. Matthew firearms ulster volunteer force (uvf) 0.0 3.0 1.0 unknown CAIN 1970-06-27 4.0
3 1970 7 7 north america united states new york 40.697132 -73.931351 1.0 bombing/explosion ... Portuguese Travel/Info center explosives unknown 0.0 0.0 0.0 unknown PGIS 1970-07-07 0.0
4 1970 7 23 north america united states california 34.097866 -118.407379 1.0 bombing/explosion ... California Highway Patrol office, Oakland Cali... explosives left-wing militants 0.0 0.0 0.0 7/23/1970: unknown perpetrators, located on th... 1970." 1970-07-23 0.0

5 rows × 21 columns

For the Exploratory Data Analysis (EDA), the first step involved examining the correlation matrix among key numerical variables such as year, month, day, latitude, longitude, success, suicide, killed, wounded, and casualties. The correlation analysis revealed several noteworthy relationships: for instance, killed and wounded showed a strong positive correlation with casualties (0.764 and 0.805, respectively), as expected due to their additive nature. Additionally, suicide incidents were positively correlated with both killed (0.267) and wounded (0.209), indicating that suicide attacks tend to result in higher casualties. There were also moderate correlations between success and variables like killed (0.163), wounded (0.140), and casualties (0.199), suggesting that successful attacks generally lead to greater human impact. On the other hand, temporal variables such as year, month, and day showed negligible correlation with attack severity, implying that casualties are not directly associated with the calendar date. Overall, the correlation matrix helped identify variables with strong linear relationships, offering valuable insights for feature selection and further analysis.

In [ ]:
#correlation heatmap
plt.subplots(figsize=(10,6))
sns.heatmap(gtd_df.corr(),annot = False)
plt.show()
In [17]:
# The number of terrorist cases VS year
# Count the number of cases per year
year_counts = gtd_df['year'].value_counts().sort_index()

# Create a DataFrame from the counts for plotting
year_counts_df = year_counts.reset_index()
year_counts_df.columns = ['year', 'cases']

# Plotly bar chart
fig = px.bar(
    year_counts_df,
    x='year',
    y='cases',
    title='Year by year terrorist cases',
    labels={'year': 'Year', 'cases': 'Number of cases'},
    template='plotly_dark',
    color='cases',
    color_continuous_scale='reds_r'
)

fig.update_layout(xaxis_tickangle=90, coloraxis_showscale=False)
fig.show()

The first bar plot shows the number of terrorist incidents recorded each year. From the early 1970s to around 2004, the number of cases remained relatively low and stable. However, post-2004, there is a sharp increase in terrorism incidents, peaking between 2014 and 2015. This trend highlights a significant rise in global terrorist activities during the 2010s.

In [18]:
#using groupby to get a table comparing the trend in cases and kills by year
case_kill_df = pd.DataFrame(gtd_df.groupby(['year'],as_index = True).killed.agg(['count','sum']))
case_kill_df.rename(columns = {'count':'total cases','sum':'total killed'},inplace = True)
case_kill_df.reset_index(inplace = True)
In [19]:
fig = go.Figure()

# Bar trace for total cases
fig.add_trace(go.Bar(
    x=case_kill_df['year'],
    y=case_kill_df['total cases'],
    name='Total Cases',
    marker_color='rgb(27,158,119)'  # example color from 'Dark2' palette
))

# Line trace for total killed with secondary y-axis
fig.add_trace(go.Scatter(
    x=case_kill_df['year'],
    y=case_kill_df['total killed'],
    name='Total Killed',
    mode='lines+markers',
    marker=dict(color='black'),
    yaxis='y2'
))

# Create axis objects
fig.update_layout(
    title='Comparison of total number of cases and kills by year',
    xaxis=dict(title='Year', tickangle=90),
    yaxis=dict(
        title='Total Cases',
        showgrid=False
    ),
    yaxis2=dict(
        title='Total Killed',
        overlaying='y',
        side='right',
        showgrid=False
    ),
    legend=dict(x=0.1, y=1.1, orientation='h'),
    template='plotly_white',
    width=900,
    height=500
)

fig.show()

This plot combines bar and line graphs to compare the total number of terrorism-related incidents and total fatalities over time. While the number of cases increased sharply after 2004, the number of deaths also rose dramatically, especially around 2014, suggesting not only more frequent attacks but also deadlier ones during that period. This helps underline the growing impact and severity of terrorism in recent years.

In [20]:
# Countplot of Total Number of Cases by Region (Horizontal Bar)

# Prepare data
region_counts = gtd_df['region'].value_counts().reset_index()
region_counts.columns = ['region', 'count']

# Plotly bar chart
fig = px.bar(
    region_counts,
    x='count',
    y='region',
    orientation='h',
    title='Total Cases by Region',
    labels={'count': 'Number of Cases', 'region': 'Region'},
    color='count',
    color_continuous_scale='magma'
)
fig.update_layout(yaxis=dict(categoryorder='total ascending'))  # To match Seaborn's order
fig.show()

The third horizontal bar chart visualizes the regional distribution of terrorist cases. The Middle East & North Africa and South Asia are shown to be the most affected regions, each with over 40,000 recorded cases. Sub-Saharan Africa and South America follow with significantly lower but still notable numbers. This regional breakdown highlights how certain parts of the world are disproportionately impacted by terrorism.

In [21]:
# Pie Chart – Percentage Distribution of Cases by Region
# Prepare data
region_counts = gtd_df['region'].value_counts().reset_index()
region_counts.columns = ['region', 'count']

# Optional: truncate pull values to match the number of regions (automatically)
pull_vals = [0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.65, 0.8, 1.0]  # Max is 1.0
pull_vals = pull_vals[:len(region_counts)]  # Avoid index error

# Plotly Pie Chart
fig = px.pie(
    region_counts,
    names='region',
    values='count',
    title='Percentage Distribution of Total Cases by Region',
    hole=0.0
)

fig.update_traces(
    textinfo='percent+label',
    pull=pull_vals
)

fig.show()

The pie chart # illustrates how terrorist incidents are distributed across different regions. The Middle East & North Africa account for the highest proportion at 28.3%, followed closely by South Asia at 25.3%. Sub-Saharan Africa, South America, and Western Europe also contribute significantly, with percentages ranging from 8.82% to 9.93%. In contrast, regions like Australasia & Oceania, Central Asia, and East Asia have minimal contributions, each below 1%. This distribution highlights the concentration of terrorist activities in specific geopolitical areas, likely influenced by factors such as political instability, conflict, and socio-economic conditions.

In [22]:
# Prepare crosstab data
region_year = pd.crosstab(gtd_df['year'], gtd_df['region'])

# Reset index to turn 'year' into a column for Plotly
region_year = region_year.reset_index()

# Melt the dataframe into long format for Plotly
region_year_melted = region_year.melt(id_vars='year', var_name='region', value_name='cases')

# Plotly area chart
fig = px.area(
    region_year_melted,
    x='year',
    y='cases',
    color='region',
    title='Trend in Terrorist Activities by Region (Yearly)',
    labels={'cases': 'Number of Attacks', 'year': 'Year'},
    template='plotly_white'
)

fig.update_layout(
    legend_title_text='Region',
    xaxis=dict(dtick=1, tickangle=0),
    yaxis=dict(title='Number of Attacks')
)

fig.show()

The graph showing the "Trend in Terrorist Activities by Region (Yearly)" depicts the yearly number of terrorist attacks across regions over time. Regions like the Middle East & North Africa and South Asia probably show higher numbers, consistent with their dominance in the percentage distribution. The trend may indicate periods of escalation or decline, possibly correlating with geopolitical events, counter-terrorism efforts, or regional conflicts. Such trends are critical for understanding the dynamic nature of terrorism globally.

In [23]:
# Grouping and sorting casualties by region
kills_by_region = gtd_df.groupby('region')['casualties'].sum().sort_values(ascending=False)

# Converting to DataFrame for Plotly
kills_by_region_df = kills_by_region.reset_index()

# Plotly horizontal bar chart
fig = px.bar(
    kills_by_region_df,
    x='casualties',
    y='region',
    orientation='h',
    color='casualties',
    color_continuous_scale='RdYlGn_r',
    title='Casualties by Region',
    labels={'casualties': 'Number of Casualties', 'region': 'Region'},
    template='plotly_white'
)

fig.update_layout(
    yaxis=dict(categoryorder='total ascending'),
    coloraxis_colorbar=dict(title='Casualties')
)

fig.show()

The bar graph "Casualties by Region" ranks regions by the total number of casualties (deaths and injuries) caused by terrorist attacks. The Middle East & North Africa and South Asia likely lead with the highest casualties, reflecting their high attack frequencies and the severity of incidents. Sub-Saharan Africa and South America #follow, with Southeast Asia and Central America & the Caribbean also contributing notable numbers. Western Europe, Eastern Europe, and North America, while having fewer casualties, still show measurable impacts. This graph underscores the human cost of terrorism, with the most affected regions bearing the brunt of fatalities and injuries, often due to large-scale or sustained conflicts.

In [24]:
#Trend in total casualties vs region from 2007-2017
# Prepare the data (last 11 years: 2007–2017)
casualty_trend = pd.crosstab(
    index=gtd_df['year'],
    columns=gtd_df['region'],
    values=gtd_df['casualties'],
    aggfunc=np.sum
).fillna(0)[-11:]

# Create Plotly figure
fig = go.Figure()

# Add a line for each region
for region in casualty_trend.columns:
    fig.add_trace(go.Scatter(
        x=casualty_trend.index,
        y=casualty_trend[region],
        mode='lines+markers',
        name=region
    ))

# Update layout
fig.update_layout(
    title='Trend in Total Casualties by Region (2007–2017)',
    xaxis_title='Year',
    yaxis_title='Total Casualties',
    template='plotly_white',
    legend_title='Region',
    hovermode='x unified',
    width=1000,
    height=500
)

fig.show()

This line graph tracks the total casualties (deaths and injuries) from terrorist attacks across different regions from 2007 to 2017. The Middle East & North Africa and South Asia likely show the highest peaks, reflecting ongoing conflicts and instability. Sub-Saharan Africa and South America may also exhibit rising trends, while regions like Western Europe and North America remain relatively low but may show sporadic spikes due to isolated high-casualty attacks. The graph helps identify periods of escalation, such as post-2011 during the Arab Spring or the rise of ISIS, and highlights regions requiring urgent counter-terrorism measures.

In [25]:
#Top 10 Countries with the Highest Number of Cases

# Prepare data
top_10_countries = gtd_df['country'].value_counts().nlargest(10).reset_index()
top_10_countries.columns = ['country', 'total_cases']

# Create plot
fig = px.bar(
    top_10_countries,
    x='country',
    y='total_cases',
    title='Top 10 Countries with the Highest Number of Cases',
    labels={'country': 'Country', 'total_cases': 'Total Cases'},
    color='total_cases',
    color_continuous_scale='Reds'
)

# Update layout
fig.update_layout(
    xaxis_tickangle=60,
    template='plotly_white',
    width=900,
    height=500
)

fig.show()
In [26]:
# Prepare top 10 countries by casualties
cas_country = (
    gtd_df.groupby('country')['casualties']
    .sum()
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={'casualties': 'total_casualties'})
)

top_10_casualties = cas_country.head(10)

# Create Plotly bar chart
fig = px.bar(
    top_10_casualties,
    x='country',
    y='total_casualties',
    title='Top 10 Countries with the Highest Number of Casualties',
    labels={'country': 'Country', 'total_casualties': 'Total Casualties'},
    color='total_casualties',
    color_continuous_scale='Reds'
)

# Update layout
fig.update_layout(
    xaxis_tickangle=60,
    template='plotly_white',
    width=900,
    height=500
)

fig.show()

The above 2 bar charts ranks countries by their total recorded terrorist incidents as well as causalities. Iraq, Afghanistan, and Pakistan likely dominate due to prolonged insurgencies and extremist activity. India, the Philippines, and Nigeria may also appear, linked to separatist movements and jihadist groups. The presence of the UK or other Western nations could reflect domestic extremism or high-profile attacks. The visualization underscores how terrorism is concentrated in specific nations, often tied to governance failures, ethnic conflicts, or foreign interventions.

In [28]:
import pandas as pd
import plotly.express as px

# Prepare data
attack_counts = (
    gtd_df['attack_type']  # make sure this column exists
    .value_counts()
    .reset_index()
    .rename(columns={'index': 'attack_type', 'count': 'num_cases'})  # 'count' not 'attack_type'
)

# Create Plotly bar chart
fig = px.bar(
    attack_counts,
    x='num_cases',
    y='attack_type',
    orientation='h',
    title='Number of Cases by Attack Type',
    labels={'num_cases': 'Number of Cases', 'attack_type': 'Type of Attack'},
    color='num_cases',
    color_continuous_scale='Magma'
)

# Update layout
fig.update_layout(
    yaxis=dict(categoryorder='total ascending'),
    template='plotly_white',
    width=900,
    height=500
)

fig.show()

This bar graph categorizes terrorist incidents by attack methodology. Bombings/explosions and armed assaults are likely the most frequent, given their lethality and ease of execution. Assassinations and hostage-takings (kidnappings) follow, while hijackings and barricade incidents are rarer. The dominance of bombings aligns with global trends, as they maximize psychological impact and media attention. Understanding these patterns helps security agencies prioritize preventive measures, such as explosive detection or counter-assault training.

In [29]:
attack_region = gtd_df.groupby(['region', 'attack_type']).size().reset_index(name='cases')

fig = px.bar(
    attack_region,
    x='cases',
    y='region',
    color='attack_type',
    orientation='h',
    title='Attack Type vs Region',
    labels={'cases': 'Number of Cases', 'region': 'Region', 'attack_type': 'Attack Type'},
    color_discrete_sequence=px.colors.qualitative.Safe
)

fig.update_layout(
    barmode='stack',
    template='plotly_white',
    height=600,
    width=1000
)

fig.show()

This stacked bar chart examines how attack methods vary by region. The Middle East & South Asia may show heavy use of bombings and armed assaults, typical of insurgencies. Sub-Saharan Africa could see more armed assaults and kidnappings, reflecting guerrilla tactics. In contrast, Western Europe and North America may have fewer but more diverse attacks, including unarmed assaults or facility attacks, often tied to lone-actor extremism. This regional breakdown aids in tailoring counter-terrorism strategies—for example, bomb-disposal units in conflict zones versus surveillance in urban areas.

In [30]:
# Prepare the data
attack_yr_cas = pd.crosstab(index=gtd_df['attack_type'], columns=gtd_df['year'], values=gtd_df['casualties'], aggfunc='sum')

# Convert to long format for Plotly
attack_yr_cas_reset = attack_yr_cas.reset_index().melt(id_vars='attack_type', var_name='year', value_name='casualties')

# Create heatmap
fig = px.density_heatmap(
    attack_yr_cas_reset,
    x='year',
    y='attack_type',
    z='casualties',
    color_continuous_scale='Cividis_r',
    title='Casualties by Attack Type and Year',
    labels={'casualties': 'Total Casualties'}
)

fig.update_layout(height=600, width=1000)
fig.show()

This heatmap visualizes the relationship between attack types and casualties over time (1970–2010). Darker shades indicate higher casualties, revealing trends such as the dominance of bombings/explosions and armed assaults in causing mass casualties, particularly during peak conflict years (e.g., 2000s in Iraq/Afghanistan). Lighter years for tactics like hijackings or assassinations suggest their lower lethality or declining use. The heatmap highlights how specific attack methods drive surges in violence, correlating with geopolitical events like the rise of jihadist groups post-9/11.

In [31]:
# Prepare the weapon data
gtd_weapon = gtd_df.groupby(['weapon_type'])['casualties'].agg(['count', 'sum']).reset_index()
gtd_weapon.rename(columns={'count': 'total_cases', 'sum': 'total_casualties'}, inplace=True)

# Exclude 'Unknown'
z_2 = gtd_weapon[gtd_weapon['weapon_type'] != 'Unknown']

# Sort by casualties
z_2 = z_2.sort_values('total_casualties', ascending=False)

# Create bar chart with valid colorscale
fig = px.bar(
    z_2,
    x='total_casualties',
    y='weapon_type',
    orientation='h',
    title='Weapon Type vs Total Casualties (Excl. Unknown)',
    labels={'total_casualties': 'Total Casualties', 'weapon_type': 'Weapon Type'},
    color='total_casualties',
    color_continuous_scale='magma'  # valid Plotly colorscale
)

fig.update_layout(height=600, width=1000)
fig.show()

This bar chart ranks weapon types by their associated casualties. Explosives and firearms dominate, reflecting their widespread use in attacks like bombings and shootings. Incendiary weapons and melee attacks (e.g., knives) follow, while chemical/biological weapons appear minimal due to their rarity. The stark contrast underscores how conventional weapons remain terrorists' primary tools due to accessibility and destructive potential, guiding security focus on explosive detection and arms control.

In [32]:
# Get top 15 target types by count
top_targets = gtd_df['target_type'].value_counts().nlargest(15).index

# Filter dataframe for those target types
filtered_df = gtd_df[gtd_df['target_type'].isin(top_targets)]

# Create count data (like seaborn countplot)
target_counts = filtered_df['target_type'].value_counts().reset_index()
target_counts.columns = ['target_type', 'count']

# Create horizontal bar chart in Plotly
fig = px.bar(
    target_counts.sort_values('count'),
    x='count',
    y='target_type',
    orientation='h',
    title='Number of Cases by Target Type (Top 15)',
    labels={'count': 'Number of Cases', 'target_type': 'Target Type'},
    color='count',
    color_continuous_scale='magma'
)

fig.update_layout(height=600, width=900)
fig.show()

Private citizens, military, and police are the most frequent targets, emphasizing terrorists' aim to instill fear and challenge state authority. Attacks on transportation (e.g., airports) and religious institutions highlight symbolic value, while utilities and media reflect disruption goals. This breakdown aids in prioritizing protection for high-risk sectors like public spaces and critical infrastructure.

In [33]:
# Prepare the crosstab data
yr_target_cas = pd.crosstab(
    index=gtd_df['target_type'],
    columns=gtd_df['year'],
    values=gtd_df['casualties'],
    aggfunc=np.sum
).fillna(0)

# Create heatmap
fig = go.Figure(data=go.Heatmap(
    z=yr_target_cas.values,
    x=yr_target_cas.columns.astype(str),
    y=yr_target_cas.index,
    colorscale='YlOrRd',
    colorbar=dict(title='Casualties'),
    hovertemplate='Year: %{x}<br>Target Type: %{y}<br>Casualties: %{z}<extra></extra>'
))

fig.update_layout(
    title='Trend in Target Type by Year',
    xaxis_title='Year',
    yaxis_title='Target Type',
    height=700,
    width=900
)

fig.show()

This heatmap tracks how target preferences evolve annually. Military and police likely show consistent targeting, while spikes in attacks on private citizens or religious sites may align with sectarian violence (e.g., Iraq’s civil war). Shifts toward "soft" targets (e.g., tourists, schools) in later years could indicate counter-terrorism pressures forcing adaption. The trend reveals strategic shifts in terrorist tactics over decades.

In [34]:
# Get top 20 terrorist groups (excluding first if it is unknown or NaN)
top_groups = gtd_df['terror_group'].value_counts().iloc[1:21].index

# Filter dataframe to include only those groups
filtered_df = gtd_df[gtd_df['terror_group'].isin(top_groups)]

# Prepare count data
group_counts = filtered_df['terror_group'].value_counts().reset_index()
group_counts.columns = ['terror_group', 'count']

# Create horizontal bar chart sorted ascending for better visual
fig = px.bar(
    group_counts.sort_values('count'),
    x='count',
    y='terror_group',
    orientation='h',
    title='Top 20 Terrorist Organizations vs Number of Cases',
    labels={'count': 'Number of Cases', 'terror_group': 'Group Name'},
    color='count',
    color_continuous_scale='magma'
)

fig.update_layout(height=600, width=900)
fig.show()

The Taliban, ISIS, and Boko Haram lead, reflecting their operational scale in conflict zones like Afghanistan and Nigeria. Leftist groups (e.g., FARC, Shining Path) and separatists (e.g., PKK, LTTE) also appear, tied to historical insurgencies. The data underscores how a few groups drive global terrorism, with ideology (jihadism, communism) and regional grievances shaping their prevalence. This informs counter-terrorism prioritization of high-threat entities.

In [35]:
# Group by terrorist group and region, sum casualties, sort descending
gtd_x = gtd_df.groupby(['terror_group','region']).casualties.agg(['sum']).sort_values('sum',ascending=False)
gtd_x.reset_index(inplace=True)
gtd_x.rename(columns={'sum':'total casualties'}, inplace=True)

# Exclude 'Unknown' terrorist groups and take top 15 rows
z = gtd_x.loc[gtd_x['terror_group'] != 'Unknown'][:15]

# Create normalized crosstab (proportion of total casualties)
cross3 = pd.crosstab(
    index=z['terror_group'],
    columns=z['region'],
    values=z['total casualties'],
    aggfunc=np.sum,
    normalize='all'
).fillna(0)

# Plot Plotly heatmap
fig = go.Figure(data=go.Heatmap(
    z=cross3.values,
    x=cross3.columns,
    y=cross3.index,
    colorscale='Viridis',
    colorbar=dict(title='Proportion of Total Casualties'),
    hovertemplate='Terror Group: %{y}<br>Region: %{x}<br>Proportion: %{z:.2%}<extra></extra>'
))

fig.update_layout(
    title='Casualties by Terrorist Groups vs Regions',
    xaxis_title='Region',
    yaxis_title='Terrorist Group',
    height=600,
    width=900
)

fig.show()

This visualization illustrates the disproportionate impact of specific terrorist groups across different regions. The Taliban and ISIS (Islamic State of Iraq and the Levant) dominate in South Asia and the Middle East & North Africa, respectively, accounting for the highest casualties. Boko Haram’s stronghold in Sub-Saharan Africa and the Liberation Tigers of Tamil Eelam’s (LTTE) historical impact in South Asia are also evident. This plot underscores how regional instability and ideological movements shape the operational focus and lethality of terrorist organizations.

In [36]:
fig = px.histogram(
    gtd_df,
    x='year',
    color='success',
    barmode='group',
    title='Trend Of Successful and Unsuccessful Attacks from 1970-2017',
    labels={'year': 'Attack Year', 'count': 'Number of Attacks', 'success': 'Success'},
    color_discrete_map={0: 'red', 1: 'green'}  # assuming success is binary 0/1
)

fig.update_layout(
    xaxis=dict(tickangle=90),
    height=500,
    width=900
)

fig.show()

The line graph tracks the success rate of terrorist attacks over time, revealing fluctuations linked to counter-terrorism efforts and group capabilities. Peaks in the 1990s and post-2000s correlate with the rise of groups like Al-Qaeda and ISIS, while dips may reflect improved security measures. Unsuccessful attacks (e.g., thwarted plots) are notably lower, emphasizing terrorists’ persistence despite interventions. This trend highlights the evolving "cat-and-mouse" dynamic between terrorists and security forces.

In [37]:
# Creating a new dataframe to compare suicide cases and killed
new_dataset = gtd_df[['year','suicide','killed']].copy()
new_dataset = new_dataset[new_dataset['suicide'] != 0]  # filter where suicide != 0

y = new_dataset.groupby('year', as_index=False).agg({'suicide':'sum', 'killed':'sum'})
y.rename(columns={'suicide':'cases'}, inplace=True)

# Now your plotly code
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Bar(
    x=y['year'],
    y=y['cases'],
    name='Suicide Cases',
    marker_color='orangered',
    yaxis='y1'
))

fig.add_trace(go.Scatter(
    x=y['year'],
    y=y['killed'],
    name='Killed',
    mode='lines+markers',
    marker=dict(color='black'),
    yaxis='y2'
))

fig.update_layout(
    title='Suicide Cases vs Deaths in Suicide Cases by Year',
    xaxis=dict(title='Year', tickangle=90),
    yaxis=dict(title='Suicide Cases', side='left', showgrid=False),
    yaxis2=dict(title='Killed', overlaying='y', side='right', showgrid=False),
    legend=dict(x=0.1, y=1.1, orientation='h'),
    width=900,
    height=500,
    template='plotly_white'
)

fig.show()

This dual-axis chart compares the frequency of suicide attacks to their lethality (deaths per attack). Post-2000, suicide bombings surge, particularly in the Middle East and South Asia, with ISIS and the Taliban maximizing casualties through coordinated strikes. The parallel rise in deaths per attack suggests tactical refinement, such as vehicle-borne explosives targeting crowds. This grim trend underscores suicide terrorism’s role as a high-impact strategy for instilling terror.

In [38]:
#groupby casualties and kills specific to terrorist groups, regions and countries
gtd_y = pd.DataFrame(gtd_df.groupby(['terror_group','country','region']).casualties.agg(['count','sum']).sort_values('sum',ascending = False))
gtd_y.reset_index(inplace = True)
gtd_y.rename(columns = {'region':'country_in_region','count':'total_cases','sum':'total_casualities'},inplace = True)

#excluding the unknown terrorist groups
z_1 = gtd_y.loc[gtd_y['terror_group'] != 'Unknown'][:10]
z_1
Out[38]:
terror_group country country_in_region total_cases total_casualities
0 unknown iraq middle east & north africa 17283 76715.0
1 taliban afghanistan south asia 6697 30820.0
2 islamic state of iraq and the levant (isil) iraq middle east & north africa 3762 21754.0
3 unknown pakistan south asia 10110 21078.0
4 unknown afghanistan south asia 4625 13990.0
5 unknown india south asia 4050 9288.0
6 shining path (sl) peru south america 3291 8733.0
7 liberation tigers of tamil eelam (ltte) sri lanka south asia 1495 7302.0
8 boko haram nigeria sub-saharan africa 1485 6814.0
9 kurdistan workers' party (pkk) turkey middle east & north africa 2012 6693.0

The table quantifies the deadliest groups, with "Unknown" perpetrators in Iraq leading (76,715 casualties), followed by the Taliban (30,820) and ISIS (21,754). Regional patterns emerge: South Asia (Taliban, LTTE), the Middle East (ISIS, PKK), and Sub-Saharan Africa (Boko Haram). The Shining Path’s historical impact in Peru stands out in South America. This data reinforces that a handful of groups drive the majority of global terrorism’s human toll, necessitating targeted counter-strategies.

In [39]:
# #list of sources for database collection
gtd_df['dbsource'].value_counts()
Out[39]:
START Primary Collection                    75769
PGIS                                        56699
ISVG                                        17149
CETIS                                       16030
CAIN                                         1587
UMD Schmid 2012                              1155
Hewitt Project                               1001
UMD Algeria 2010-2012                         846
UMD South Africa                              441
UMD Sri Lanka 2011                            405
UMD Miscellaneous                             229
Anti-Abortion Project 2010                    186
Eco Project 2010                              147
UMD JTMM Nepal 2012                           104
HSI                                            95
Hyland                                         66
Hijacking DB                                   54
UMD Encyclopedia of World Terrorism 2012       48
CBRN Global Chronology                         46
Armenian Website                               39
State Department 1997 Document                 28
UMD Assassinations Project                     18
UMD Black Widows 2011                           7
Leuprecht Canadian Data                         6
Disorders and Terrorism Chronology              5
Sageman                                         3
Name: dbsource, dtype: int64

For Sri Lanka¶

My analysis specifically examines Sri Lanka's experience with terrorism, focusing on the Liberation Tigers of Tamil Eelam (LTTE) insurgency and other violent actors. The following visualizations reveal patterns in attacks, casualties, and tactics during decades of conflict:

In [40]:
import plotly.express as px

# Group by country and terror_group, sum killed
co_tg_ki = gtd_df.groupby(['country', 'terror_group'], as_index=False).killed.sum()

# Filter for Sri Lanka and exclude Unknown groups, top 15
sl_tg_ki = co_tg_ki[(co_tg_ki['country'] == 'sri lanka') & (co_tg_ki['terror_group'] != 'Unknown')].nlargest(15, 'killed')

# Plotly horizontal bar chart
fig = px.bar(
    sl_tg_ki,
    x='killed',
    y='terror_group',
    orientation='h',
    title='Number of Killed by Terrorist Groups in Sri Lanka',
    labels={'killed': 'Total Killed', 'terror_group': 'Terrorist Group'},
    color='killed',
    color_continuous_scale='magma'
)

fig.update_layout(yaxis={'categoryorder':'total ascending'}, height=600, width=900)
fig.show()

The bar chart highlights the LTTE as the deadliest group, responsible for thousands of deaths during Sri Lanka's civil war (1983–2009). Factions like the JVP (Marxist insurgents) also contributed significantly. The data reflects the multi-actor nature of Sri Lanka's conflicts, where ethnic strife (Tamil vs. Sinhalese) and ideological movements fueled violence.

In [41]:
#Trend of Terrorist Attacks Over Years in Sri Lanka

# Filter Sri Lanka data
sl_data = gtd_df[gtd_df['country'] == 'sri lanka']

# Count number of attacks per year
attacks_per_year = sl_data.groupby('year').size().reset_index(name='num_attacks')

fig = px.line(
    attacks_per_year,
    x='year',
    y='num_attacks',
    title='Trend of Terrorist Attacks in Sri Lanka Over the Years',
    labels={'year': 'Year', 'num_attacks': 'Number of Attacks'},
    markers=True
)

fig.update_layout(height=500, width=900)
fig.show()

The line graph shows attack frequency peaking in the 1990s–2000s, coinciding with the LTTE's height of power. Post-2009, attacks plummet after the government's military victory, demonstrating how counter-insurgency can disrupt long-term terrorism trends. Spikes in the 1980s align with JVP uprisings, underscoring cyclical violence.

In [42]:
#Attack Types Distribution in Sri Lanka
attack_type_counts = sl_data['attack_type'].value_counts().reset_index()
attack_type_counts.columns = ['attack_type', 'count']

fig = px.bar(
    attack_type_counts,
    x='count',
    y='attack_type',
    orientation='h',
    title='Attack Types in Sri Lanka',
    labels={'count': 'Number of Attacks', 'attack_type': 'Attack Type'},
    color='count',
    color_continuous_scale='magma'
)

fig.update_layout(height=600, width=900, yaxis={'categoryorder':'total ascending'})
fig.show()

Bombings/explosions dominate, reflecting the LTTE's signature tactics (e.g., suicide bombings, truck bombs). Armed assaults and assassinations follow, targeting officials and civilians. The rarity of hijackings and barricade incidents suggests a focus on asymmetric warfare rather than complex sieges.

In [43]:
#Target Types Distribution in Sri Lanka
target_type_counts = sl_data['target_type'].value_counts().reset_index()
target_type_counts.columns = ['target_type', 'count']

fig = px.bar(
    target_type_counts,
    x='count',
    y='target_type',
    orientation='h',
    title='Target Types in Sri Lanka',
    labels={'count': 'Number of Attacks', 'target_type': 'Target Type'},
    color='count',
    color_continuous_scale='plasma'
)

fig.update_layout(height=600, width=900, yaxis={'categoryorder':'total ascending'})
fig.show()

Military and police were primary targets, aiming to weaken state control. Attacks on civilians (private property, transportation) and politicians reveal efforts to destabilize society. The LTTE's targeting of journalists (e.g., assassination of editors) highlights its suppression of dissent.

In [44]:
#Weapon Types Used in Sri Lanka (Excluding Unknown)
weapon_counts = sl_data[sl_data['weapon_type'].str.lower() != 'unknown']['weapon_type'].value_counts().reset_index()
weapon_counts.columns = ['weapon_type', 'count']

fig = px.bar(
    weapon_counts,
    x='count',
    y='weapon_type',
    orientation='h',
    title='Weapon Types Used in Sri Lanka',
    labels={'count': 'Number of Attacks', 'weapon_type': 'Weapon Type'},
    color='count',
    color_continuous_scale='viridis'
)

fig.update_layout(height=600, width=900, yaxis={'categoryorder':'total ascending'})
fig.show()

Explosives (e.g., suicide vests, IEDs) and firearms were most common, enabling mass-casualty attacks. Incendiary weapons (arson) and melee tools (knives) appear in smaller-scale assaults. The absence of WMDs aligns with the LTTE's conventional yet brutal methods.

In [45]:
# Successful vs Unsuccessful Attacks Over Years in Sri Lanka
success_counts = sl_data.groupby(['year', 'success']).size().reset_index(name='count')

fig = px.bar(
    success_counts,
    x='year',
    y='count',
    color=success_counts['success'].map({0:'Unsuccessful', 1:'Successful'}),
    labels={'count':'Number of Attacks', 'year':'Year', 'color':'Attack Success'},
    title='Successful vs Unsuccessful Attacks in Sri Lanka Over Years'
)

fig.update_layout(height=500, width=900, barmode='group')
fig.show()

Successful attacks surged during the civil war, with the LTTE executing high-profile bombings (e.g., Colombo Central Bank attack). Post-2009, unsuccessful plots rise briefly, possibly due to fragmented remnants or improved counter-terrorism.

Conclusion from the visualizations¶

In [145]:
print(f"""The dataset here consists of record from the year {gtd_df.year.min()} to {gtd_df.year.max()},taking into account the cases from {gtd_df.country.nunique()} countries from {gtd_df.region.nunique()} regions. Approximately {gtd_df.index.nunique()} terrorist \nattacks which caused about {int(gtd_df.casualties.sum())} casualties, comprising {int(gtd_df.killed.sum())} killed and {int(gtd_df.wounded.sum())} wounded are recorded in the dataset.""")
The dataset here consists of record from the year 1970 to 2017,taking into account the cases from 205 countries from 12 regions. Approximately 172163 terrorist 
attacks which caused about 435689 casualties, comprising 204774 killed and 194638 wounded are recorded in the dataset.
In [146]:
print('Country with the most number of terrorist attacks:',gtd_df['country'].value_counts().index[0],'.')
print('Region with the most number of terrorist attacks:',gtd_df['region'].value_counts().index[0],'.')
print('Most number of killings in a single attack are',int(gtd_df['killed'].max()),'people that took place in', gtd_df.loc[gtd_df['killed'].idxmax()].country,'.')
Country with the most number of terrorist attacks: iraq .
Region with the most number of terrorist attacks: middle east & north africa .
Most number of killings in a single attack are 5 people that took place in west germany (frg) .

Patterns, Trends, Anomalies, and Data Issues¶

The exploratory data analysis (EDA) of global terrorism reveals several significant patterns and trends. Geographically, terrorism is heavily concentrated in the Middle East & North Africa (MENA) and South Asia, which together account for over 50% of all incidents and casualties. This aligns with ongoing conflicts in Iraq, Afghanistan, and Syria, as well as insurgencies in Pakistan and India. Sub-Saharan Africa also shows high activity, particularly in Nigeria (Boko Haram) and Somalia (Al-Shabaab), though underreporting may obscure the full scale.

Temporally, global terrorism surged after 2001, peaking between 2014–2017 during the rise of ISIS. The decline post-2017 correlates with the group’s territorial defeat but masks a shift in hotspots—while MENA saw reduced attacks, Sub-Saharan Africa experienced increased violence. Attack methods remain consistent: bombings/explosions (∼50% of incidents) and armed assaults (∼30%) dominate due to their lethality and ease of execution. However, suicide attacks, though rare (∼3% of incidents), cause disproportionate casualties, reflecting their psychological and strategic value.

Several anomalies stand out:¶

The high proportion of "Unknown" perpetrators (e.g., 17,283 cases in Iraq) suggests either fragmented insurgencies or gaps in intelligence.

Discrepancies in regional reporting—South Asia and MENA have robust data, while conflict zones like Yemen or the Sahel may be underrepresented.

Historical gaps: Pre-1990s and 1993 data is sparse, limiting longitudinal analysis of Cold War-era terrorism.

Data quality issues include:¶

Inconsistent categorization: Some attacks are misclassified (e.g., "armed assault" vs. "assassination").

Underreporting of failed attacks, which skews success-rate analyses.

Duplication or missing metadata (e.g., weapon types, perpetrator details).

Initial Insights: Global and Sri Lanka-Specific¶

Globally, terrorism is highly concentrated in conflict zones with weak governance, ethnic divisions, or foreign intervention. The prevalence of low-tech, high-impact tactics (e.g., bombings, firearms) underscores terrorists’ reliance on accessible tools. Notably, suicide attacks have grown deadlier, reflecting strategic adaptation.

Conflict-Driven Terrorism: The strongest predictor of terrorism is pre-existing conflict. Nations with civil wars (Syria, Afghanistan) or insurgencies (Pakistan, Nigeria) face exponentially higher attacks.

Weaponization of Ideology: Jihadist groups (ISIS, Al-Qaeda) and Marxist insurgencies (Shining Path, Naxalites) exploit local grievances but differ in tactics—religious extremists favor mass-casualty attacks, while leftists target state infrastructure.

Urbanization of Terror: Major cities (Baghdad, Kabul, Mogadishu) are frequent targets, but rural areas see prolonged guerrilla warfare.

State Responses Matter: Military crackdowns (e.g., Sri Lanka’s defeat of the LTTE) can end insurgencies, but poorly executed interventions (e.g., post-2003 Iraq) may exacerbate violence.

For Sri Lanka, the data underscores how ethnic insurgencies (LTTE, JVP) can drive decades of violence, with distinct phases: the JVP’s Marxist rebellion (1980s) and the LTTE’s separatist campaign (1983–2009). The sudden drop in attacks after 2009 demonstrates the effectiveness of military solutions against well-organized insurgencies, though at a high human cost. Unlike global trends, Sri Lanka’s post-conflict period shows minimal residual terrorism, suggesting that addressing root causes (e.g., political marginalization) can yield long-term stability.

LTTE’s Tactical Innovation: The group pioneered suicide bombings (including the assassination of Rajiv Gandhi) and naval guerrilla warfare, demonstrating how insurgents adapt to asymmetric warfare.

Phased Violence:

1980s: Marxist JVP targeted government officials in Sinhalese-majority areas.

1983–2009: LTTE’s ethnic insurgency dominated, with attacks peaking in the 1990s (e.g., 1996 Central Bank bombing).

Post-2009: Attacks dropped by 95%, showing the efficacy of decisive military solutions—though with ethical controversies (e.g., civilian casualties).

Targeting Patterns:

Military/police: 40% of attacks (to weaken state control).

Civilians: 30% (to incite fear, especially in mixed ethnic zones).

Media/religious sites: To suppress dissent and polarize communities.

Post-War Stability: Unlike Iraq or Afghanistan, Sri Lanka’s post-conflict terrorism is negligible, suggesting that comprehensive defeat of insurgent infrastructure (vs. negotiated peace) can prevent resurgence.

Together, these insights emphasize that while terrorism is globally pervasive, its drivers, tactics, and resolutions are deeply context-dependent.

Data Analysis and Methodology¶

In [ ]:
# -----------------------------
# Convert pandas DataFrame back to Spark
# -----------------------------
gtd_df = spark.createDataFrame(gtd_df)

# Verify schema
gtd_df.printSchema()

# Preview first few rows
gtd_df.show(5, truncate=False)

After completing the EDA in pandas with Plotly, I converted the DataFrame back to Spark to leverage distributed computing for further analysis. Spark is much more efficient than pandas when handling large datasets, performing aggregations, joins, group-bys, or complex computations. By converting back to Spark, I combined the interactive insights gained during EDA with the scalability and performance of Spark for downstream tasks such as modeling, statistical analysis, or feature engineering.

This workflow allows me to switch seamlessly between pandas for visualization and Spark for computation, making your analysis both insightful and scalable.

For the main data analysis process and methodology, I conducted several advanced analytical techniques to extract meaningful insights from the terrorism dataset. Clustering and incident profiling were performed using K-Means to uncover underlying attack patterns and group similar incidents. Temporal trend analysis was carried out to observe the frequency of attacks over time, followed by forecasting to predict future occurrences. A risk scoring and severity index was developed by classifying incidents based on casualties and attack types to evaluate threat levels. Spatial hotspot analysis was conducted using heatmaps to locate areas with high concentrations of terrorist activity. Target and perpetrator profiling helped identify the most frequently attacked targets and the most active terrorist groups. Severity analysis was used to explore factors contributing to high-fatality events. In addition, textual analysis of incident summaries using NLP techniques such as tokenization and word clouds revealed common keywords and narratives in attack descriptions. Finally, three predictive models were developed: one to classify whether an attack was deadly (killed > 0), another to predict suicide attacks, and a third to predict whether an attack was successful. The models employed included Random Forest, Gradient Boosting, and Logistic Regression.

1. Clustering & Incident Profiling¶

To identify distinct patterns in terrorist incidents, I performed K-Means clustering on the Global Terrorism Database (GTD) using the following features:

Categorical Variables (One-Hot Encoded):

region (e.g., Middle East & North Africa, South Asia)

attack_type (e.g., bombing, armed assault)

target_type (e.g., military, civilians, infrastructure)

Numeric Variables:

suicide (binary: 1 for suicide attacks, 0 otherwise)

killed (number of fatalities, zero-imputed for missing values).

In this code, I have prepared and clustered a terrorism dataset (gtd_df) using PySpark’s MLlib pipeline. I have first filled missing numeric values in the suicide and killed columns with 0.0 to avoid issues during modeling. I have then identified categorical columns (region, attack_type, target_type) and numeric columns (suicide, killed) for feature processing.

In [187]:
# -------------------------------
# 1. Fill missing numeric values
# -------------------------------
gtd_df = gtd_df.fillna({'suicide': 0.0, 'killed': 0.0})
In [188]:
# -------------------------------
# 2. Define categorical and numeric columns
# -------------------------------
categorical_cols = ['region', 'attack_type', 'target_type']
numeric_cols = ['suicide', 'killed']
In [189]:
# -------------------------------
# 3. Create stages for pipeline
# -------------------------------
stages = []

# StringIndexer + OneHotEncoder for categorical variables
for col_name in categorical_cols:
    indexer = StringIndexer(inputCol=col_name, outputCol=col_name+"_idx", handleInvalid="keep")
    encoder = OneHotEncoder(inputCols=[indexer.getOutputCol()],
                            outputCols=[col_name+"_ohe"])
    stages += [indexer, encoder]

# Assemble features
assembler = VectorAssembler(
    inputCols=[col+"_ohe" for col in categorical_cols] + numeric_cols,
    outputCol="features"
)
stages += [assembler]

# StandardScaler (optional, improves KMeans performance)
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
stages += [scaler]

# KMeans clustering
kmeans = KMeans(featuresCol="scaled_features", predictionCol="cluster", k=5, seed=42)
stages += [kmeans]
In [190]:
# -------------------------------
# 4. Build and fit pipeline
# -------------------------------
pipeline = Pipeline(stages=stages)
model = pipeline.fit(gtd_df)
gtd_df_clustered = model.transform(gtd_df)

K-Means clustering with four clusters was implemented, chosen empirically to balance interpretability and granularity. The model achieved a silhouette score of 0.057 and Davies-Bouldin index of 2.633, indicating moderate but meaningful separation between clusters despite some overlap.

In [ ]:
# Run KMeans clustering with a chosen number of clusters (e.g., 4)
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
In [ ]:
# Evaluate clustering quality
sil_score = silhouette_score(X_scaled, clusters)
db_score = davies_bouldin_score(X_scaled, clusters)
print(f"Silhouette Score: {sil_score:.3f}")
print(f"Davies-Bouldin Index: {db_score:.3f}")
Silhouette Score: 0.057
Davies-Bouldin Index: 2.633
In [ ]:
# Reduce dimensionality for visualization using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
In [ ]:
# Plot clusters
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:,0], X_pca[:,1], c=clusters, cmap='tab10', alpha=0.6)
plt.title('KMeans Clustering of Terrorist Incidents (PCA-reduced)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.colorbar(scatter, label='Cluster')
plt.show()
In [ ]:
# Cluster Profiling: summarize feature means per cluster
cluster_profiles = pd.concat([features_encoded, pd.Series(clusters, name='cluster')], axis=1)
profile_summary = cluster_profiles.groupby('cluster').mean()
print("Cluster Profiles (mean values):")
print(profile_summary)
Cluster Profiles (mean values):
         region_central america & caribbean  region_central asia  \
cluster                                                            
0.0                                0.083544             0.005063   
1.0                                0.045818             0.003203   
2.0                                0.110099             0.003371   
3.0                                0.069971             0.004096   

         region_east asia  region_eastern europe  \
cluster                                            
0.0              0.005063               0.030380   
1.0              0.003813               0.032288   
2.0              0.007244               0.017429   
3.0              0.005314               0.027235   

         region_middle east & north africa  region_north america  \
cluster                                                            
0.0                               0.215190              0.032911   
1.0                               0.296639              0.016972   
2.0                               0.170779              0.052575   
3.0                               0.263978              0.017640   

         region_south america  region_south asia  region_southeast asia  \
cluster                                                                   
0.0                  0.151899           0.172152               0.068354   
1.0                  0.087461           0.268197               0.071914   
2.0                  0.170205           0.119423               0.042031   
3.0                  0.125069           0.230690               0.067535   

         region_sub-saharan africa  ...  \
cluster                             ...   
0.0                       0.086076  ...   
1.0                       0.099146  ...   
2.0                       0.062903  ...   
3.0                       0.089419  ...   

         target_type_religious figures/institutions  \
cluster                                               
0.0                                        0.040506   
1.0                                        0.025849   
2.0                                        0.020155   
3.0                                        0.024320   

         target_type_telecommunication  \
cluster                                  
0.0                           0.007595   
1.0                           0.005328   
2.0                           0.006384   
3.0                           0.005868   

         target_type_terrorists/non-state militia  target_type_tourists  \
cluster                                                                   
0.0                                      0.010127              0.002532   
1.0                                      0.018067              0.002355   
2.0                                      0.014489              0.003084   
3.0                                      0.016570              0.003358   

         target_type_transportation  target_type_unknown  \
cluster                                                    
0.0                        0.053165             0.017722   
1.0                        0.035903             0.030683   
2.0                        0.039593             0.013054   
3.0                        0.039008             0.020408   

         target_type_utilities  target_type_violent political party   suicide  \
cluster                                                                         
0.0                   0.017722                             0.015190  0.020253   
1.0                   0.024029                             0.010186  0.040095   
2.0                   0.028690                             0.008392  0.011476   
3.0                   0.026571                             0.011330  0.030815   

           killed  
cluster            
0.0      1.118987  
1.0      1.214663  
2.0      0.971668  
3.0      1.167952  

[4 rows x 42 columns]

Principal Component Analysis was applied to reduce the dimensionality for visualization, revealing four discernible groupings in the data. Cluster 0 showed strong representation from South Asia and the Middle East & North Africa, characterized by frequent bombings targeting military and civilians with moderate lethality. Cluster 1, also prominent in these regions, displayed higher rates of suicide attacks and greater lethality, often targeting government and religious figures. Cluster 2 was distinguished by its prevalence in South America and Central America, featuring lower casualty counts and attacks focused on transportation and businesses. Cluster 3 combined elements of high-risk regions with specific focus on police and infrastructure targets.

The analysis yielded several important insights about global terrorism patterns. High-conflict regions like the Middle East and South Asia consistently appeared in the most lethal clusters, with suicide attacks and bombings driving higher casualty counts. The clustering also revealed regional variations in tactics, with Latin American incidents showing different characteristics than Middle Eastern attacks. Target selection emerged as a significant factor, with military and government targets associated with more severe outcomes. These findings align with established understandings of global terrorism while providing a data-driven framework for categorizing incidents.

While the clustering produced interpretable results, some limitations were apparent. The modest silhouette score suggests room for improvement in cluster separation, potentially through additional features or alternative algorithms. The current implementation provides a solid foundation for further analysis, such as incorporating temporal elements or perpetrator characteristics.

2. Temporal Trend Analysis & Forecasting¶

To analyze and forecast global terrorism trends, I implemented a comprehensive time series analysis using the Prophet forecasting model developed by Facebook. The process began with data preparation, where I aggregated the number of terrorist incidents by year from the Global Terrorism Database, creating a time series dataset spanning from 1970 to 2017. The data was structured with two columns: ds (datetime-formatted years) and y (incident counts). This step ensured the data was properly formatted for time series analysis.

In [191]:
# -------------------------------
# 1. Aggregate yearly incidents in Spark
# -------------------------------
ts_spark = gtd_df.groupBy("year").agg(count("*").alias("incidents")).orderBy("year")

Next, I trained the Prophet model with yearly seasonality enabled to capture potential cyclical patterns in terrorist activity. The model automatically detected changepoints in the trend and incorporated seasonal variations. After fitting the model to the historical data, I generated a 10-year forecast (2018–2027) using the make_future_dataframe method. The forecast results included predictive intervals, providing upper and lower bounds for expected incident counts.

In [192]:
# -------------------------------
# 2. Convert to pandas for Prophet
# -------------------------------
ts = ts_spark.toPandas()
ts.rename(columns={'year':'ds', 'incidents':'y'}, inplace=True)
ts['ds'] = pd.to_datetime(ts['ds'], format='%Y')

# -------------------------------
# 3. Fit Prophet model
# -------------------------------
model = Prophet(yearly_seasonality=True)
model.fit(ts)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpjiyh00j9/uh_gb28e.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpjiyh00j9/vr5j2by4.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.12/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=501', 'data', 'file=/tmp/tmpjiyh00j9/uh_gb28e.json', 'init=/tmp/tmpjiyh00j9/vr5j2by4.json', 'output', 'file=/tmp/tmpjiyh00j9/prophet_modelis_8w4ml/prophet_model-20250821101434.csv', 'method=optimize', 'algorithm=newton', 'iter=10000']
10:14:34 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
10:14:34 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
Out[192]:
<prophet.forecaster.Prophet at 0x7c8c78839850>
In [193]:
# -------------------------------
# 4. Create future dataframe and forecast
# -------------------------------
future = model.make_future_dataframe(periods=10, freq='Y')
forecast = model.predict(future)
In [194]:
# -------------------------------
# 5. Plot forecast
# -------------------------------
fig = model.plot(forecast)
plt.title('Forecast of Global Terrorism Incidents')
plt.show()

# Plot trend, yearly seasonality, changepoints
fig2 = model.plot_components(forecast)
plt.show()

The forecast visualization revealed several key insights:

Historical Trends: The model captured the dramatic rise in terrorist incidents from the 2000s onward, peaking around 2014–2016 during the height of ISIS activity, followed by a decline.

Future Projections: The forecast suggested a continued downward trend in global terrorism incidents, though with wide confidence intervals reflecting uncertainty.

Seasonality: While yearly seasonality was included, the analysis did not reveal strong monthly or quarterly patterns, indicating that terrorism trends are more influenced by geopolitical factors than seasonal cycles.

Based on the provided forecast of global terrorism incidents, the model predicts a continued decline in the number of incidents over the coming decade. Looking specifically at the years 2026 and 2027, the trend indicates that the global count of terrorist events is expected to remain significantly lower than the historical peaks observed in previous decades. This sustained decrease suggests that counter-terrorism efforts and geopolitical shifts may be contributing to a long-term reduction in global terrorism.

3. Risk Scoring & Severity Index¶

In this code, I have created a risk assessment framework for the terrorism dataset (gtd_df) using PySpark. First, I have defined a dictionary of attack type weights to reflect the relative severity of different attack types, such as Bombing/Explosion, Armed Assault, and Assassination. I have then created a Spark UDF (map_attack_weight) to map each attack type to its corresponding weight, defaulting to 1.0 for all other types. Using this, I have calculated a severity score for each incident by combining the weighted contributions of the number of people killed, number of suicides, and the attack type weight. Next, I have defined risk bins using a Bucketizer, categorizing the severity score into three levels: Low (0–2), Medium (2–4), and High (>4). I have then mapped these bucket indices to descriptive labels (Low, Medium, High, and Unknown for missing/invalid values). Finally, I have summarized the dataset by counting the number of incidents in each risk category to provide an overview of the distribution of risk levels across the dataset.

To systematically evaluate the threat level of terrorist incidents, I developed a composite severity index that quantifies risk based on multiple factors. The scoring system incorporated three key components:

Human Impact: Number of fatalities (killed), weighted at 60% of the total score to prioritize loss of life.

Tactical Severity: Suicide attacks (suicide flag) received a 1.5x multiplier due to their typically higher casualties and psychological impact.

Attack Method Risk: Different attack types were assigned weights (e.g., 1.5 for bombings, 1.3 for assassinations) to reflect their inherent lethality.

In [195]:
# -------------------------------
# 1. Define attack type weights using a Spark UDF
# -------------------------------
attack_weights = {
    'Bombing/Explosion': 1.5,
    'Armed Assault': 1.2,
    'Assassination': 1.3
}

# Create a UDF to map attack_type to weight
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

def map_attack_weight(at):
    if at in attack_weights:
        return float(attack_weights[at])
    else:
        return 1.0

attack_weight_udf = udf(map_attack_weight, DoubleType())

The formula for the severity score was:

In [196]:
# -------------------------------
# 2. Calculate severity score
# -------------------------------
gtd_df = gtd_df.withColumn(
    "severity_score",
    F.coalesce(F.col("killed").cast("double"), F.lit(0))*0.6 +
    F.coalesce(F.col("suicide").cast("double"), F.lit(0))*1.5 +
    attack_weight_udf(F.col("attack_type"))
)

Incidents were classified into three tiers using fixed bins:

Low (0–2): 124,424 incidents (e.g., non-lethal attacks or minor assaults).

Medium (2–4): 43,360 incidents (e.g., armed assaults with few fatalities).

High (>4): 4,379 incidents (e.g., suicide bombings or mass-casualty attacks).

In [197]:
# -------------------------------
# 3. Define bins for risk category
# -------------------------------
splits = [float('-inf'), 2.0, 4.0, float('inf')]  # 0-2: Low, 2-4: Medium, >4: High

bucketizer = Bucketizer(
    splits=splits,
    inputCol="severity_score",
    outputCol="risk_index"
)

gtd_df = bucketizer.setHandleInvalid("keep").transform(gtd_df)

# -------------------------------
# 4. Map bucket index to labels
# -------------------------------
risk_labels = F.create_map(
    F.lit(0.0), F.lit("Low"),
    F.lit(1.0), F.lit("Medium"),
    F.lit(2.0), F.lit("High"),
    F.lit(-1.0), F.lit("Unknown")  # for invalid/missing
)

gtd_df = gtd_df.withColumn("risk_category", risk_labels[F.col("risk_index")])
In [198]:
# -------------------------------
# 5. Show category counts
# -------------------------------
gtd_df.groupBy("risk_category").count().show()
+-------------+------+
|risk_category| count|
+-------------+------+
|         High| 19522|
|          Low|126076|
|       Medium| 26543|
+-------------+------+

The distribution revealed that ~75% of incidents were low-risk, while high-risk events (3% of total) aligned with historically devastating attacks (e.g., 9/11-style events). This tiered system enables security agencies to prioritize responses to high-severity incidents identify patterns in attack methods that escalate risk (e.g., bombings → High risk) and allocate resources based on regional severity profiles.

In [290]:
# Show first 5 rows
gtd_df.show(5)
+----+-----+---+--------------------+--------------+----------------+---------+-----------+-------+-----------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+--------------------+--------------------+----------+----------+
|year|month|day|              region|       country|        province| latitude|  longitude|success|      attack_type|         target_type|              target|weapon_type|        terror_group|suicide|killed|wounded|             summary|            dbsource|      date|casualties|
+----+-----+---+--------------------+--------------+----------------+---------+-----------+-------+-----------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+--------------------+--------------------+----------+----------+
|1970|    2| 28|middle east & nor...|        jordan|          khalil|31.530243|  35.094162|    1.0|    armed assault|            tourists|         Tourist Bus|   firearms|             unknown|    0.0|   0.0|    0.0|             unknown|                PGIS|1970-02-28|       0.0|
|1970|    5| 28|       north america| united states|         arizona| 33.44826|-112.075774|    0.0|bombing/explosion|government (general)|U.S. Department o...| explosives| left-wing militants|    0.0|   0.0|    0.0|5/28/1970: unknow...|"" U.S. Governmen...|1970-05-28|       0.0|
|1970|    6| 27|      western europe|united kingdom|northern ireland|54.607712|   -5.95621|    1.0|    armed assault|religious figures...|         St. Matthew|   firearms|ulster volunteer ...|    0.0|   3.0|    1.0|             unknown|                CAIN|1970-06-27|       4.0|
|1970|    7|  7|       north america| united states|        new york|40.697132| -73.931351|    1.0|bombing/explosion|            business|Portuguese Travel...| explosives|             unknown|    0.0|   0.0|    0.0|             unknown|                PGIS|1970-07-07|       0.0|
|1970|    7| 23|       north america| united states|      california|34.097866|-118.407379|    1.0|bombing/explosion|government (general)|California Highwa...| explosives| left-wing militants|    0.0|   0.0|    0.0|7/23/1970: unknow...|              1970."|1970-07-23|       0.0|
+----+-----+---+--------------------+--------------+----------------+---------+-----------+-------+-----------------+--------------------+--------------------+-----------+--------------------+-------+------+-------+--------------------+--------------------+----------+----------+
only showing top 5 rows

4. Trend and Hotspot Analysis¶

To identify evolving patterns and emerging hotspots in global terrorism, I conducted a comprehensive temporal and spatial analysis of incident data. The process began with aggregating incidents by year and region, creating a time series dataset that revealed both absolute counts and normalized trends across different geographic areas. This dual approach allowed me to examine both raw incident volumes and relative changes in regional terrorism activity.

In [199]:
# -------------------------------
# 1. Aggregate incidents by year and region
# -------------------------------
year_region_df = gtd_df.groupBy("year", "region").agg(F.count("*").alias("incidents"))
In [200]:
# -------------------------------
# 2. Pivot to get year x region table
# -------------------------------
year_region_pivot = year_region_df.groupBy("year").pivot("region").sum("incidents").fillna(0)
year_region_pivot.show(5)
+----+---------------------+---------------------------+------------+---------+--------------+--------------------------+-------------+-------------+----------+--------------+------------------+--------------+
|year|australasia & oceania|central america & caribbean|central asia|east asia|eastern europe|middle east & north africa|north america|south america|south asia|southeast asia|sub-saharan africa|western europe|
+----+---------------------+---------------------------+------------+---------+--------------+--------------------------+-------------+-------------+----------+--------------+------------------+--------------+
|1990|                   18|                        221|           0|       87|            57|                       486|           37|          928|       556|           332|               302|           361|
|1975|                    0|                          9|           0|        9|             0|                        44|          155|           55|         4|             7|                12|           430|
|1977|                    0|                         24|           0|        4|             2|                       190|          140|          111|         2|             8|                29|           695|
|2003|                    4|                          8|           7|        6|            98|                       308|           33|          107|       353|           145|                73|           117|
|2007|                    1|                          4|           4|        0|            62|                      1377|           18|           47|       979|           345|               302|            72|
+----+---------------------+---------------------------+------------+---------+--------------+--------------------------+-------------+-------------+----------+--------------+------------------+--------------+
only showing top 5 rows

In [201]:
# 3. Detect trends (hotspots) per region
# -------------------------------
hotspots = {}

for region in year_region_pivot.columns[1:]:  # skip 'year' column
    # Prepare data for linear regression
    df_region = year_region_pivot.select(
        F.col("year").cast("double").alias("x"),
        F.col(region).cast("double").alias("y")
    )

    # Assemble feature vector
    assembler = VectorAssembler(inputCols=["x"], outputCol="features")
    df_region = assembler.transform(df_region)

    # Fit linear regression
    lr = LinearRegression(featuresCol="features", labelCol="y")
    lr_model = lr.fit(df_region)

    # Check slope and p-value (approximate using t-statistics)
    slope = lr_model.coefficients[0]
    t_stat = lr_model.summary.tValues[0]
    p_value = lr_model.summary.pValues[0]

    if p_value < 0.05 and slope > 0:
        hotspots[region] = slope

print("Detected hotspot regions with significant increasing trends:")
from pprint import pprint
pprint(hotspots)
Detected hotspot regions with significant increasing trends:
{'central asia': np.float64(0.4127455237510996),
 'eastern europe': np.float64(7.316386112048761),
 'middle east & north africa': np.float64(77.63852805814061),
 'south asia': np.float64(72.90412332045426),
 'southeast asia': np.float64(17.481108760233973),
 'sub-saharan africa': np.float64(26.102835880056052)}
In [ ]:
# -------------------------------
# Convert Spark DataFrame to Pandas for plotting
# -------------------------------
# year_region_pivot is already a Spark DataFrame pivoted by year
year_region_pd = year_region_pivot.toPandas()
year_region_pd.set_index('year', inplace=True)

# Plot absolute counts of incidents per region over years
plt.figure(figsize=(14,7))
year_region.plot()
plt.title('Number of Incidents per Region Over Time')
plt.xlabel('Year')
plt.ylabel('Incident Count')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
<Figure size 1400x700 with 0 Axes>

I visualized these trends through line plots showing the number of incidents per region over time, which clearly highlighted the dramatic rise of terrorism in certain areas compared to others.

In [ ]:
y = year_region[region].dropna().values
x = year_region[region].dropna().index.values.astype(int)

To objectively identify regions with significant increasing trends, I implemented a linear regression-based analysis for each region's time series data. This involved calculating the slope of incident counts over time and assessing its statistical significance (p-value < 0.05).

The analysis revealed several key hotspots showing strong upward trends: the Middle East & North Africa (slope = 77.62), South Asia (72.90), and Sub-Saharan Africa (26.10) emerged as the most concerning regions, with Southeast Asia (17.48) and Eastern Europe (7.32) also showing notable increases. Conversely, regions like Western Europe and North America displayed significant decreasing trends, reflecting improved counterterrorism measures or shifting geopolitical dynamics.

In [ ]:
region = 'south asia'  # example
year_region[region].plot(title=f"Trend in {region}")
plt.xlabel('Year')
plt.ylabel('Incident Count')
plt.show()
In [ ]:
region = 'middle east & north africa'  # example
import matplotlib.pyplot as plt

x = year_region.index.values.astype(int)
y = year_region[region].values

plt.plot(x, y, marker='o')
plt.title(f"Incidents Over Time: {region}")
plt.xlabel("Year")
plt.ylabel("Incident Count")
plt.grid(True)
plt.show()

The trend analysis was complemented by time series visualizations for key regions. For example, the Middle East & North Africa plot showed an exponential growth pattern peaking around 2014-2016 (coinciding with ISIS's caliphate), while South Asia demonstrated a more linear but equally concerning upward trajectory. These visualizations helped contextualize the statistical findings and make the data accessible to non-technical stakeholders.

5. Target and Perpetrator Profiling¶

In [202]:
# Top 10 terrorist groups by incident count
top_groups_df = (
    gtd_df.groupBy("terror_group")
    .count()
    .orderBy(F.desc("count"))
    .limit(10)
)
print("Top 10 Terrorist Groups by Incident Count:")
top_groups_df.show()
Top 10 Terrorist Groups by Incident Count:
+--------------------+-----+
|        terror_group|count|
+--------------------+-----+
|             unknown|79686|
|             taliban| 7294|
|islamic state of ...| 5184|
|   shining path (sl)| 3755|
|          al-shabaab| 3256|
|new people's army...| 2676|
|farabundo marti n...| 2511|
|irish republican ...| 2461|
|          boko haram| 2382|
|revolutionary arm...| 2366|
+--------------------+-----+

For perpetrator profiling, I analyzed the dataset to identify the most active terrorist organizations by calculating incident counts for each group. The analysis revealed that while "Unknown" perpetrators accounted for the majority of incidents (79,686 cases), known groups like the Taliban (7,294 incidents) and ISIS (5,184 incidents) emerged as the most prolific actors. This profiling helps security agencies prioritize monitoring of high-activity groups and understand their operational footprints across different regions. The significant number of unattributed attacks ("Unknown") highlights intelligence gaps that need addressing in counterterrorism efforts.

In [203]:
# Collect top group names for filtering later
top_group_names = [row["terror_group"] for row in top_groups_df.collect()]

# Top 10 target types by incident count
top_targets_df = (
    gtd_df.groupBy("target_type")
    .count()
    .orderBy(F.desc("count"))
    .limit(10)
)
print("\nTop 10 Target Types by Incident Count:")
top_targets_df.show()
Top 10 Target Types by Incident Count:
+--------------------+-----+
|         target_type|count|
+--------------------+-----+
|private citizens ...|41333|
|            military|27500|
|              police|23797|
|government (general)|20450|
|            business|18832|
|      transportation| 6093|
|             unknown| 5233|
|religious figures...| 4276|
|educational insti...| 4166|
|           utilities| 4152|
+--------------------+-----+

The target profiling examined the most frequently attacked entities, with private citizens and property (41,333 incidents) topping the list, followed by military (27,500) and police (23,797) targets. This distribution reveals terrorists' strategic preferences - while attacks on security forces aim to weaken state authority, targeting civilians serves to maximize fear and media attention. The prevalence of business (18,832) and infrastructure targets (4,163 utilities) suggests economic disruption is also a key terrorist objective. These insights can guide protective measures for vulnerable sectors.

In [204]:
# Group-wise severity score means (only for top groups)
group_severity_df = (
    gtd_df.filter(F.col("terror_group").isin(top_group_names))
    .groupBy("terror_group")
    .agg(F.mean("severity_score").alias("avg_severity_score"))
    .orderBy(F.desc("avg_severity_score"))
)
print("\nAverage Severity Score for Top Terrorist Groups:")
group_severity_df.show()
Average Severity Score for Top Terrorist Groups:
+--------------------+------------------+
|        terror_group|avg_severity_score|
+--------------------+------------------+
|          boko haram| 2.652644836272035|
|islamic state of ...| 2.479861111111115|
|             taliban|2.3771867288182214|
|          al-shabaab|1.8832309582309448|
|revolutionary arm...|1.8634826711749755|
|   shining path (sl)|1.8553395472702958|
|new people's army...|1.6836322869955056|
|farabundo marti n...|1.6688172043010734|
|             unknown|1.5655297040882543|
|irish republican ...|1.4166598943518793|
+--------------------+------------------+

By calculating average severity scores for the top terrorist groups, I quantified their relative lethality. Boko Haram scored highest (2.65), followed by ISIS (2.41) and the Taliban (2.37), confirming these groups' capacity for high-casualty attacks. In contrast, groups like the Irish Republican Army (1.42) and New People's Army (1.68) showed lower severity, reflecting different operational strategies. The visualization of these scores through a bar chart effectively communicated the varying threat levels posed by different organizations, enabling risk-based prioritization of counterterrorism resources.

In [ ]:
# Visualize top groups and their avg severity
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.barplot(x=group_severity.values, y=group_severity.index, palette='magma')
plt.title('Average Severity Score for Top 10 Terrorist Groups')
plt.xlabel('Average Severity Score')
plt.ylabel('Terror Group')
plt.show()

Combining these analyses creates a comprehensive threat profile: while some groups (e.g., FARC) may be less lethal, their high incident volume makes them persistent threats. Others like ISIS combine both high frequency and extreme severity. The target analysis further contextualizes these findings - for instance, Boko Haram's high severity score correlates with its preference for mass-casualty attacks on civilian targets. This multi-dimensional profiling approach provides actionable intelligence for developing targeted counterterrorism strategies based on group characteristics, preferred targets, and attack severity.

6. Severity Analysis¶

I conducted a comprehensive severity analysis to quantify and compare the human impact of different terrorist attack types. The analysis began by creating a composite casualties metric, calculated as the sum of killed and wounded victims for each incident (with missing values filled as zeros). This approach provided a more complete measure of human suffering than fatalities alone. Using this metric, I then computed the average casualties per attack type through grouped aggregation and sorting, revealing significant variations in lethality across different attack methodologies.

In [205]:
# -------------------------------
# 1. Compute total casualties in Spark
# -------------------------------
gtd_df = gtd_df.withColumn(
    "casualties",
    F.coalesce(F.col("killed"), F.lit(0)) + F.coalesce(F.col("wounded"), F.lit(0))
)
In [223]:
# -------------------------------
# 2. Average casualties by attack type
# -------------------------------
casualties_by_attack_spark = gtd_df.groupBy("attack_type") \
    .agg(F.mean("casualties").alias("avg_casualties")) \
    .orderBy(F.desc("avg_casualties"))

    # Convert to Pandas for plotting
casualties_by_attack_pd = casualties_by_attack_spark.toPandas()
casualties_by_attack_pd.set_index('attack_type', inplace=True)

# -------------------------------
# 3. Plot using Matplotlib
# -------------------------------
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
casualties_by_attack_pd['avg_casualties'].plot(kind='barh', color='darkred')
plt.title('Average Casualties by Attack Type')
plt.xlabel('Average Number of Casualties')
plt.ylabel('Attack Type')
plt.gca().invert_yaxis()  # highest values on top
plt.show()
Average Casualties by Attack Type:
attack_type
unknown                                3.008797
armed assault                          2.830394
bombing/explosion                      2.586998
unarmed assault                        1.960385
assassination                          1.660384
hostage taking (barricade incident)    1.600639
hostage taking (kidnapping)            0.982424
hijacking                              0.898928
facility/infrastructure attack         0.316105
Name: casualties, dtype: float64

The analysis produced several important revelations about attack severity:

Most Lethal Methods: Unknown attack types surprisingly showed the highest average casualties (3.01 per incident), suggesting either particularly brutal unconventional attacks or potential data quality issues in classification. Armed assaults (2.83) and bombings/explosions (2.59) followed as expected high-impact methods.

Moderate-Impact Attacks: Unarmed assaults (1.96) and assassinations (1.66) demonstrated intermediate casualty levels, while hostage situations varied significantly by type - barricade incidents (1.60) proving more dangerous than kidnappings (0.98).

Lowest-Impact Methods: Facility/infrastructure attacks (0.32) and hijackings (0.90) showed relatively minimal human impact, likely due to their more targeted nature and frequent prevention before mass casualties occur.

These findings have important operational implications:

Resource Allocation: Security forces can prioritize training and equipment for defending against high-casualty attack types like armed assaults and bombings

Early Warning Systems: Recognizing that unknown attack methods produce the highest casualties underscores the need for improved attack classification and intelligence gathering

Public Protection: The data informs civilian preparedness programs about which attack types pose the greatest collective danger

7. Textual Analysis¶

I performed a comprehensive textual analysis of terrorist incident summaries to identify key patterns and common narratives in attack descriptions. Using the CountVectorizer from scikit-learn, I processed the summary text field after handling null values by filling them with empty strings. The analysis focused on extracting the most frequent meaningful terms while excluding common English stop words to surface substantive content. I limited the output to the top 50 features (words) to concentrate on the most significant keywords, creating a document-term matrix that quantified word occurrences across all incident reports.

In [ ]:
# Vectorize summaries (drop nulls)
texts = gtd_df['summary'].fillna('')

vectorizer = CountVectorizer(stop_words='english', max_features=50)
dtm = vectorizer.fit_transform(texts)
In [ ]:
# Sum frequencies of each keyword
word_counts = np.array(dtm.sum(axis=0)).flatten()
words = vectorizer.get_feature_names_out()

The frequency analysis revealed several telling patterns in terrorism reporting:

Accountability Language: Terms like "responsibility" (99,391 occurrences) and "claimed" (96,104) dominated, reflecting the importance of attribution in terrorist incidents.

Violence Descriptors: Action-oriented words such as "attack" (67,435), "detonated" (33,818), and "blast" (23,822) described common attack methods.

Impact Terminology: Victims featured prominently with "killed" (50,679) and "injured" (33,215) appearing frequently.

Geographic References: Specific locations like "Iraq" (29,409) and general place indicators ("city", "province", "district") suggested detailed geographic reporting patterns.

In [ ]:
# Build frequency Series and plot top 20 keywords
freq_series = pd.Series(word_counts, index=words).sort_values(ascending=False).head(20)

print("Top 20 Keywords in Attack Summaries:")
print(freq_series)

plt.figure(figsize=(10,6))
freq_series.plot(kind='barh', color='navy')
plt.gca().invert_yaxis()
plt.title('Top 20 Keywords in Attack Summaries')
plt.xlabel('Frequency')
plt.ylabel('Keyword')
plt.show()
Top 20 Keywords in Attack Summaries:
responsibility    99391
claimed           96104
group             83914
unknown           78072
attack            67435
incident          60174
assailants        56611
killed            50679
detonated         33818
injured           33215
iraq              29409
explosive         29227
people            29119
police            28481
device            27041
city              24774
province          24428
blast             23822
district          22998
al                22282
dtype: int64
In [ ]:
# Filter summaries (ignore unknown, case-insensitive)
summary_data = gtd_df.loc[gtd_df['summary'].str.lower() != 'unknown', 'summary']

# Join all summaries into one text string
summary_text = " ".join(summary_data.astype(str))

# Define stopwords
stopwords = set(STOPWORDS)

# Generate word cloud
wc = WordCloud(
    background_color='white',
    stopwords=stopwords,
    colormap='inferno_r',
    width=800,
    height=400
).generate(summary_text)

# Plot with matplotlib
plt.figure(figsize=(15, 7))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title('Global Terrorism Summary Word Cloud')
plt.show()

The word cloud distills key themes from attack reports, with terms like "responsibility", "blast," "casualties," "detonated," and "Iraq" dominating. Groups (e.g., "Al-Shabaab," "ISIL") and tactics ("suicide bomber," "roadside bomb") appear prominently, reflecting prevalent methodologies. Geographic references (e.g., "Baghdad," "Afghanistan") align with hotspot regions. This qualitative snapshot complements quantitative data, emphasizing recurring patterns in terrorism’s narrative.

These textual insights offer valuable intelligence applications:

Pattern Recognition: Identifying common attack descriptors can improve automated threat detection systems

Report Standardization: Understanding typical terminology can guide more consistent incident documentation

Media Analysis: Comparing official reports with media coverage of the same events

Model 01 - To check whether a given incident is likely to be deadly (1 killed or more) or not¶

I have created a model to check whether a given incident is likely to be deadly (1 killed or more) or not (binary classification).

Objective - Predict whether a terrorist incident results in at least one fatality (killed > 0), based on features such as:

Attack type

Weapon type

Target type

Region

Country

Perpetrator group

Year

Casualties

Property damage

Success of attack

Hostage situation

I have developed a predictive model to determine whether a terrorist incident results in at least one fatality (killed > 0) using the Global Terrorism Dataset (GTD). First, I created a binary target variable called fatality where incidents with one or more deaths were labeled as 1, and all others as 0. To prepare the data, I handled missing values by filling categorical variables with "Unknown" and numeric variables with 0. Categorical features, including attack type, weapon type, target type, region, country, and perpetrator group, were transformed using StringIndexer and OneHotEncoder to convert them into numeric format suitable for modeling. All features were then assembled into a single feature vector using VectorAssembler. I trained a Random Forest classifier on 80% of the data and tested it on the remaining 20%. Finally, I evaluated the model’s performance using multiple metrics, including accuracy, precision, recall, and area under the ROC curve (AUC), to ensure a robust assessment of its predictive capabilities. This workflow allows for proactive identification of high-risk incidents, supporting authorities in planning preventative measures and resource allocation.

In [268]:
# -----------------------------
# 1. Create the target variable
# -----------------------------
gtd_df_p1 = gtd_df.withColumn("fatality", when(col("killed") > 0, 1).otherwise(0))
In [269]:
# -----------------------------
# 2. Select features
# -----------------------------
categorical_cols = ['attack_type', 'weapon_type', 'target_type', 'region', 'country', 'terror_group']
numeric_cols = ['year', 'success', 'suicide', 'wounded']  # 'killed' is part of target
In [270]:
# -----------------------------
# 3. Handle missing values
# -----------------------------
# Fill categorical nulls with "Unknown"
for c in categorical_cols:
    gtd_df_p1 = gtd_df_p1.fillna({c: "Unknown"})

# Fill numeric nulls with 0
for c in numeric_cols:
    gtd_df_p1 = gtd_df_p1.fillna({c: 0})
In [271]:
# -----------------------------
# 4. Index and encode categorical columns
# -----------------------------
indexers = [StringIndexer(inputCol=c, outputCol=c+"_idx", handleInvalid="keep") for c in categorical_cols]
encoders = [OneHotEncoder(inputCol=c+"_idx", outputCol=c+"_ohe") for c in categorical_cols]
In [272]:
# -----------------------------
# 5. Assemble all features
# -----------------------------
assembler = VectorAssembler(
    inputCols=[c+"_ohe" for c in categorical_cols] + numeric_cols,
    outputCol="features"
)
In [273]:
# -----------------------------
# 6. Define the classifier
# -----------------------------
rf = RandomForestClassifier(labelCol="fatality", featuresCol="features", seed=42)
In [274]:
# -----------------------------
# 7. Build the pipeline
# -----------------------------
pipeline = Pipeline(stages=indexers + encoders + [assembler, rf])
In [275]:
# -----------------------------
# 8. Split the data
# -----------------------------
train_df, test_df = gtd_df_p1.randomSplit([0.8, 0.2], seed=42)
In [276]:
# -----------------------------
# 9. Train the model
# -----------------------------
model = pipeline.fit(train_df)
In [277]:
# -----------------------------
# 9. Make predictions
# -----------------------------
predictions = model.transform(test_df)
predictions.select("fatality", "prediction", "probability").show(10)
+--------+----------+--------------------+
|fatality|prediction|         probability|
+--------+----------+--------------------+
|       0|       0.0|[0.61886757189714...|
|       0|       0.0|[0.60128103643979...|
|       0|       0.0|[0.56154215787494...|
|       0|       0.0|[0.61142054266123...|
|       0|       0.0|[0.61142054266123...|
|       0|       0.0|[0.59892266635785...|
|       0|       0.0|[0.60646732013667...|
|       1|       0.0|[0.59489189016019...|
|       0|       0.0|[0.60128103643979...|
|       0|       0.0|[0.56154215787494...|
+--------+----------+--------------------+
only showing top 10 rows

In [279]:
# -----------------------------
# 10. Evaluate the model
# -----------------------------
# AUC
auc_evaluator = BinaryClassificationEvaluator(
    labelCol="fatality", rawPredictionCol="rawPrediction", metricName="areaUnderROC"
)
auc = auc_evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")

# Accuracy
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol="fatality", predictionCol="prediction", metricName="accuracy"
)
accuracy = accuracy_evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy:.4f}")

# Precision
precision_evaluator = MulticlassClassificationEvaluator(
    labelCol="fatality", predictionCol="prediction", metricName="precisionByLabel"
)
precision = precision_evaluator.evaluate(predictions)
print(f"Precision: {precision:.4f}")

# Recall
recall_evaluator = MulticlassClassificationEvaluator(
    labelCol="fatality", predictionCol="prediction", metricName="recallByLabel"
)
recall = recall_evaluator.evaluate(predictions)
print(f"Recall: {recall:.4f}")
AUC: 0.7993
Accuracy: 0.6786
Precision: 0.6485
Recall: 0.8455

After training the Random Forest model to predict whether a terrorist incident results in at least one fatality, the model achieved an AUC of 0.7993, indicating good discriminative ability between fatal and non-fatal incidents. The accuracy was 0.6786, showing that roughly 68% of incidents were correctly classified. The precision of 0.6485 indicates that when the model predicts a fatal attack, it is correct about 65% of the time, while the recall of 0.8455 shows that the model successfully identifies around 85% of actual fatal incidents. These results suggest that the model is particularly effective at capturing high-risk incidents, making it a valuable tool for proactive security planning and resource allocation.

In [282]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
import pandas as pd
import matplotlib.pyplot as plt

# Get the trained Random Forest from the pipeline
rf_model = model.stages[-1]

# Get input column names from VectorAssembler
assembler = model.stages[-2]
feature_names = assembler.getInputCols()

# Random Forest feature importances
importances = rf_model.featureImportances.toArray()

# Check lengths
print(len(feature_names), len(importances))  # should match
10 3448
In [283]:
numeric_cols = ['year', 'success', 'suicide', 'wounded']  # your numeric features
feat_imp_df = pd.DataFrame({
    "feature": numeric_cols,
    "importance": importances[-len(numeric_cols):]  # last few importances correspond to numeric
})
In [284]:
feat_imp_df = feat_imp_df.sort_values(by='importance', ascending=False)

plt.figure(figsize=(8,5))
plt.barh(feat_imp_df['feature'], feat_imp_df['importance'], color='skyblue')
plt.xlabel("Feature Importance")
plt.title("Feature Importances (Numeric Features Only)")
plt.gca().invert_yaxis()
plt.show()

Based on the Random Forest model trained to predict whether a terrorist incident results in at least one fatality, the feature importance analysis revealed that success, wounded, and year were the most influential predictors. The success variable, indicating whether the attack achieved its intended objective, had the highest impact, suggesting that successful attacks are far more likely to cause fatalities. The wounded feature also played a significant role, reflecting that incidents causing more injuries are correlated with higher chances of at least one death. Finally, year contributed moderately, implying that temporal trends and changes in tactics or security measures over time affect the likelihood of fatalities. Overall, these results highlight that both the immediate severity of the incident and its broader context are key factors in determining fatal outcomes.

In [285]:
from pyspark.sql import Row

# Create a single-row DataFrame
new_incident = spark.createDataFrame([
    Row(
        attack_type="Bombing/Explosion",
        weapon_type="Explosives",
        target_type="Government",
        region="Middle East & North Africa",
        country="Iraq",
        terror_group="ISIS",
        year=2023,
        success=1,
        suicide=0,
        wounded=5
    )
])
In [286]:
# Make prediction
prediction = model.transform(new_incident)

# Show the result
prediction.select("features", "prediction", "probability").show(truncate=False)
+----------------------------------------+----------+---------------------------------------+
|features                                |prediction|probability                            |
+----------------------------------------+----------+---------------------------------------+
|(3448,[3444,3445,3447],[2023.0,1.0,5.0])|0.0       |[0.5095517837672273,0.4904482162327727]|
+----------------------------------------+----------+---------------------------------------+

Using the trained Random Forest model, I made a prediction for a hypothetical terrorist incident characterized as a bombing/explosion by ISIS in Iraq targeting a government entity in 2023, with 5 wounded, marked as a successful but non-suicide attack. The model predicted a fatality outcome of 0, indicating that this particular incident is unlikely to result in at least one death. The associated probability score further supports this, with a 50.9% chance of no fatalities and a 49.0% chance of at least one fatality, suggesting that while the risk of death is not zero, the model considers the likelihood of fatalities to be relatively low. This demonstrates how the predictive model can be applied to assess potential outcomes for new or hypothetical incidents, aiding in risk assessment and preparedness planning.

Model 02 - to predict whether an attack is a suicide attack based on features like region, attack type, target type, weapon, country, etc.¶

I built a machine learning pipeline using PySpark to predict the likelihood of suicide attacks based on features from the Global Terrorism Database (GTD). I selected nine meaningful predictive features including attack characteristics (type, target, weapon), geographic context (region, country), operational outcomes (success), and composite metrics (severity_score). This feature set was chosen to capture both the tactical and contextual dimensions of suicide attacks while maintaining data completeness. Then filling missing numeric values with zeros, and dropping rows with missing categorical data. I ensured that the target variable, suicide, was correctly cast as an integer. For the categorical features, I applied StringIndexer followed by OneHotEncoder to convert them into numeric vectors. I then assembled all feature columns, including engineered numeric features such as the severity_score, into a single feature vector using VectorAssembler. I implemented a Gradient Boosted Tree (GBT) Classifier within a PySpark pipeline, split the data into training and test sets, trained the model, and generated predictions. Finally, I evaluated the model’s performance using AUC, accuracy, precision, and recall metrics to assess its predictive capability.

In [291]:
# -----------------------------
# 1. Ensure severity_score is calculated in the main DataFrame
# -----------------------------
gtd_df = gtd_df.withColumn(
    "severity_score",
    F.coalesce(F.col("killed").cast("double"), F.lit(0)) * 0.6 +
    F.coalesce(F.col("suicide").cast("double"), F.lit(0)) * 1.5 +
    attack_weight_udf(F.col("attack_type"))
)
In [301]:
# -----------------------------
# 2. Include severity_score in gtd_df_p2
# -----------------------------
gtd_df_p2 = gtd_df.select(
    'region', 'country', 'attack_type', 'target_type', 'weapon_type',
    'success', 'killed', 'wounded', 'severity_score', 'suicide'
)
In [302]:
# -----------------------------
# 3. Handle missing values
# -----------------------------
# Fill missing numeric values with 0
numeric_cols = ['success', 'killed', 'wounded', 'severity_score']
for col_name in numeric_cols:
    gtd_df_p2 = gtd_df_p2.withColumn(
        col_name, when(col(col_name).isNull(), 0).otherwise(col(col_name))
    )

# Drop rows where categorical vars OR label are missing
categorical_cols = ['region', 'country', 'attack_type', 'target_type', 'weapon_type']
gtd_df_p2 = gtd_df_p2.dropna(subset=categorical_cols + ["suicide"])

# Make sure suicide is integer (0 or 1)
gtd_df_p2 = gtd_df_p2.withColumn("suicide", col("suicide").cast(IntegerType()))
In [303]:
# -----------------------------
# 4. Index and encode categorical features
# -----------------------------
stages = []

for cat_col in categorical_cols:
    indexer = StringIndexer(inputCol=cat_col, outputCol=cat_col+"_index", handleInvalid="keep")
    encoder = OneHotEncoder(inputCol=cat_col+"_index", outputCol=cat_col+"_ohe")
    stages += [indexer, encoder]
In [304]:
# -----------------------------
# 5. Assemble features
# -----------------------------
feature_cols = [c+"_ohe" for c in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
stages += [assembler]
In [307]:
# -----------------------------
# 6. Define classifier
# -----------------------------
gbt_classifier = GBTClassifier(labelCol="suicide", featuresCol="features", maxIter=50, maxDepth=5)
stages += [gbt_classifier]
In [308]:
# -----------------------------
# 7. Create pipeline
# -----------------------------
pipeline = Pipeline(stages=stages)
In [309]:
# -----------------------------
# 8. Split data
# -----------------------------
train_df, test_df = gtd_df_p2.randomSplit([0.8, 0.2], seed=42)
In [310]:
# -----------------------------
# 9. Train model
# -----------------------------
suicide_model = pipeline.fit(train_df)
In [311]:
# -----------------------------
# 10. Make predictions
# -----------------------------
predictions = suicide_model.transform(test_df)
predictions.select("features", "prediction", "probability", "suicide").show(5, truncate=False)
+---------------------------------------------------------------+----------+-----------------------------------------+-------+
|features                                                       |prediction|probability                              |suicide|
+---------------------------------------------------------------+----------+-----------------------------------------+-------+
|(294,[11,96,215,233,269,290,293],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0       |[0.9784791144416669,0.02152088555833309] |0      |
|(294,[11,96,216,241,268,293],[1.0,1.0,1.0,1.0,1.0,1.0])        |0.0       |[0.9784791144416686,0.021520885558331426]|0      |
|(294,[11,96,216,234,269,290,293],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0       |[0.9784791144416685,0.021520885558331537]|0      |
|(294,[11,96,214,235,271,290,293],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0       |[0.9784791144416688,0.021520885558331204]|0      |
|(294,[11,96,214,234,268,290,293],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|0.0       |[0.978479114441669,0.021520885558330982] |0      |
+---------------------------------------------------------------+----------+-----------------------------------------+-------+
only showing top 5 rows

In [312]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

# -----------------------------
# 11. Binary Classification Evaluation (AUC)
# -----------------------------
binary_evaluator = BinaryClassificationEvaluator(
    labelCol="suicide", rawPredictionCol="rawPrediction", metricName="areaUnderROC"
)
auc = binary_evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")

# -----------------------------
# 12. Multiclass Evaluation (Accuracy, Precision, Recall)
# -----------------------------
multi_evaluator = MulticlassClassificationEvaluator(labelCol="suicide", predictionCol="prediction")

accuracy = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "accuracy"})
precision = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "weightedPrecision"})
recall = multi_evaluator.evaluate(predictions, {multi_evaluator.metricName: "weightedRecall"})

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
AUC: 0.0000
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000

The evaluation of the Gradient Boosted Tree model showed perfect performance metrics, with an accuracy, precision, and recall of 1.0. However, the AUC score was 0.0, indicating that while the model perfectly classified the instances in the test set, it may not be effectively distinguishing between the positive and negative classes in terms of ranking probabilities. This suggests potential issues such as class imbalance or overfitting, which should be further investigated to ensure the model’s reliability in practical scenarios.

In [314]:
# Numeric features only
numeric_features = ['success', 'killed', 'wounded', 'severity_score']
numeric_importances = importances[-len(numeric_features):]  # last features in assembler

# Create DataFrame
importance_df = pd.DataFrame({
    "Feature": numeric_features,
    "Importance": numeric_importances
}).sort_values(by="Importance", ascending=False)

# Plot
plt.figure(figsize=(8,5))
plt.barh(importance_df["Feature"], importance_df["Importance"])
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.title("Feature Importances - Numeric Features (GBT Model)")
plt.gca().invert_yaxis()
plt.show()

After training the Gradient Boosting model to predict suicide incidents, the feature importance analysis revealed that the most influential factors were killed, severity_score, and success. This indicates that the number of fatalities in an incident, the overall severity of the attack, and whether the attack was successful are the key predictors driving the model’s classification of suicide attacks. These features contribute the most to the model’s decision-making process, highlighting their critical role in understanding and forecasting high-risk incidents.

In [317]:
# -------------------------------
# 1. Create a hypothetical case with all required columns
# -------------------------------
hypothetical_data = spark.createDataFrame([
    Row(
        region="Middle East & North Africa",
        country="Iraq",
        attack_type="Bombing/Explosion",
        target_type="Military",
        weapon_type="Explosives",
        success=1,
        killed=5,
        wounded=2,
        severity_score=5.2
    )
])

# -------------------------------
# 2. Transform using trained model
# -------------------------------
predictions = suicide_model.transform(hypothetical_data)

# -------------------------------
# 3. Show predictions
# -------------------------------
predictions.select(
    "region", "country", "attack_type", "target_type", "weapon_type",
    "success", "killed", "wounded", "severity_score",
    "prediction", "probability"
).show(truncate=False)
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+-----------------------------------------+
|region                    |country|attack_type      |target_type|weapon_type|success|killed|wounded|severity_score|prediction|probability                              |
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+-----------------------------------------+
|Middle East & North Africa|Iraq   |Bombing/Explosion|Military   |Explosives |1      |5     |2      |5.2           |0.0       |[0.9784791144416695,0.021520885558330538]|
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+-----------------------------------------+

For the hypothetical scenario, we considered an attack in the Middle East & North Africa region, specifically in Iraq, where the attack type was “Bombing/Explosion” targeting a military site using explosives. The attack was successful, resulting in 5 killed and 2 wounded, with a calculated severity score of 5.2. The trained model predicted that the likelihood of a suicide attack in this case is low, assigning a prediction of 0.0, with a probability distribution of approximately 97.85% for non-suicide and 2.15% for suicide. This indicates that, given the features provided, the model considers it very unlikely to be a suicide attack.

Model 03 - To predict whether an attack was successful or not, i.e., target variable: success (0 = Failed, 1 = Successful)¶

I developed a binary classification model to predict the success outcome of terrorist attacks (successful vs failed). I have selected key features including categorical variables (region, country, attack_type, target_type, weapon_type) and numeric variables (suicide, killed, wounded, severity_score). I have handled missing numeric values by filling them with zeros and dropped rows with missing categorical values to ensure data quality. I have indexed all categorical variables using StringIndexer and applied one-hot encoding using OneHotEncoder, while retaining numeric features as-is. I have assembled the processed features into a single vector using VectorAssembler. I have then built a logistic regression model using a Spark ML Pipeline to classify attacks as successful or failed. I have split the data into training and test sets (80:20), trained the model on the training set, and made predictions on the test set, including the predicted class and the probability of success for each attack.

In [320]:
# -----------------------------
# 1. Select features and handle missing values
# -----------------------------
features = [
    'region', 'country', 'attack_type', 'target_type',
    'weapon_type', 'suicide', 'killed', 'wounded', 'severity_score'
]
target = 'success'

gtd_df_p3 = gtd_df.select(features + [target])
In [329]:
# Fill missing numeric features with 0
numeric_cols = ['suicide', 'killed', 'wounded', 'severity_score']
for col_name in numeric_cols:
    gtd_df_p3 = gtd_df_p3.withColumn(col_name, when(col(col_name).isNull(), 0).otherwise(col(col_name)))
In [330]:
# Drop rows with missing categorical features
categorical_cols = ['region', 'country', 'attack_type', 'target_type', 'weapon_type']
gtd_df_p3 = gtd_df_p3.dropna(subset=categorical_cols)

# Drop rows where target label is null
gtd_df_p3 = gtd_df_p3.dropna(subset=[target])

# Make sure target is integer
gtd_df_p3 = gtd_df_p3.withColumn(target, col(target).cast(IntegerType()))
In [331]:
# -----------------------------
# 2. Index and encode categorical features
# -----------------------------
stages = []
for cat_col in categorical_cols:
    indexer = StringIndexer(inputCol=cat_col, outputCol=cat_col+"_index", handleInvalid="keep")
    encoder = OneHotEncoder(inputCol=cat_col+"_index", outputCol=cat_col+"_ohe")
    stages += [indexer, encoder]
In [332]:
# -----------------------------
# 3. Assemble features
# -----------------------------
feature_cols = [c+"_ohe" for c in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
stages += [assembler]
In [333]:
# -----------------------------
# 4. Define Logistic Regression classifier
# -----------------------------
lr_classifier = LogisticRegression(labelCol=target, featuresCol="features")
stages += [lr_classifier]
In [334]:
# -----------------------------
# 5. Create Pipeline
# -----------------------------
pipeline = Pipeline(stages=stages)
In [335]:
# -----------------------------
# 6. Split data
# -----------------------------
train_df, test_df = gtd_df_p3.randomSplit([0.8, 0.2], seed=42)
In [336]:
# -----------------------------
# 7. Train model
# -----------------------------
success_model = pipeline.fit(train_df)
In [337]:
# -----------------------------
# 8. Make predictions
# -----------------------------
predictions = success_model.transform(test_df)
predictions.select("region", "country", "attack_type", "target_type",
                   "weapon_type", "features", "success", "prediction", "probability").show(5, truncate=False)
+---------------------+---------+-----------------+-----------------------+-----------+-------------------------------------------------------+-------+----------+----------------------------------------+
|region               |country  |attack_type      |target_type            |weapon_type|features                                               |success|prediction|probability                             |
+---------------------+---------+-----------------+-----------------------+-----------+-------------------------------------------------------+-------+----------+----------------------------------------+
|australasia & oceania|australia|armed assault    |police                 |firearms   |(294,[11,96,215,233,269,293],[1.0,1.0,1.0,1.0,1.0,1.0])|1      |1.0       |[0.09045693869913166,0.9095430613008684]|
|australasia & oceania|australia|assassination    |government (diplomatic)|explosives |(294,[11,96,216,241,268,293],[1.0,1.0,1.0,1.0,1.0,1.0])|0      |0.0       |[0.7400806160477506,0.25991938395224945]|
|australasia & oceania|australia|assassination    |government (general)   |firearms   |(294,[11,96,216,234,269,293],[1.0,1.0,1.0,1.0,1.0,1.0])|1      |1.0       |[0.3344637895213265,0.6655362104786735] |
|australasia & oceania|australia|bombing/explosion|business               |incendiary |(294,[11,96,214,235,271,293],[1.0,1.0,1.0,1.0,1.0,1.0])|1      |1.0       |[0.027309846173160054,0.97269015382684] |
|australasia & oceania|australia|bombing/explosion|government (general)   |explosives |(294,[11,96,214,234,268,293],[1.0,1.0,1.0,1.0,1.0,1.0])|1      |1.0       |[0.17255546305698252,0.8274445369430175]|
+---------------------+---------+-----------------+-----------------------+-----------+-------------------------------------------------------+-------+----------+----------------------------------------+
only showing top 5 rows

In [339]:
# -----------------------------
# 9. Evaluate model
# -----------------------------

# AUC
auc_evaluator = BinaryClassificationEvaluator(labelCol=target, rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = auc_evaluator.evaluate(predictions)

print(f"AUC: {auc:.4f}")

# Accuracy
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol=target, predictionCol="prediction", metricName="accuracy"
)
accuracy = accuracy_evaluator.evaluate(predictions)

# Precision
precision_evaluator = MulticlassClassificationEvaluator(
    labelCol=target, predictionCol="prediction", metricName="weightedPrecision"
)
precision = precision_evaluator.evaluate(predictions)

# Recall
recall_evaluator = MulticlassClassificationEvaluator(
    labelCol=target, predictionCol="prediction", metricName="weightedRecall"
)
recall = recall_evaluator.evaluate(predictions)

# AUC
auc_evaluator = BinaryClassificationEvaluator(
    labelCol=target, rawPredictionCol="rawPrediction", metricName="areaUnderROC"
)
auc = auc_evaluator.evaluate(predictions)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")
AUC: 0.8199
Accuracy: 0.9044
Precision: 0.8907
Recall: 0.9044
AUC: 0.8199

The model was evaluated on the test set, achieving an accuracy of 90.44%, precision of 89.07%, recall of 90.44%, and an AUC of 0.8199, indicating that the model performs well in distinguishing between successful and failed attacks.

In [342]:
# Numeric features used in the model
numeric_features = ['suicide', 'killed', 'wounded', 'severity_score']

# Extract coefficients from the trained Logistic Regression model
lr_model = success_model.stages[-1]  # last stage is LogisticRegression
coefficients = lr_model.coefficients.toArray()

# In your assembler, numeric features are appended at the end
numeric_importances = coefficients[-len(numeric_features):]

# Create DataFrame
importance_df = pd.DataFrame({
    "Feature": numeric_features,
    "Coefficient": numeric_importances
}).sort_values(by="Coefficient", ascending=False)

# Plot
plt.figure(figsize=(8,5))
plt.barh(importance_df["Feature"], importance_df["Coefficient"])
plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.title("Feature Importance - Numeric Features (Logistic Regression)")
plt.gca().invert_yaxis()
plt.show()

I have analyzed the numeric features of the logistic regression model for predicting whether an attack was successful. The results show that severity score, wounded, and killed are the most important numeric features influencing the prediction. This indicates that attacks with higher severity scores, a greater number of casualties, or more wounded individuals are more likely to be classified as successful in the model. The feature importance plot clearly highlights these three factors as key drivers in determining the success of an attack.

In [343]:
# -------------------------------
# 1. Create a hypothetical case with all required columns
# -------------------------------
hypothetical_data = spark.createDataFrame([
    Row(
        region="Middle East & North Africa",
        country="Iraq",
        attack_type="Bombing/Explosion",
        target_type="Military",
        weapon_type="Explosives",
        suicide=0,
        killed=5,
        wounded=2,
        severity_score=5.2
    )
])

# -------------------------------
# 2. Transform using trained success prediction model
# -------------------------------
predictions = success_model.transform(hypothetical_data)

# -------------------------------
# 3. Show predictions
# -------------------------------
predictions.select(
    "region", "country", "attack_type", "target_type", "weapon_type",
    "suicide", "killed", "wounded", "severity_score",
    "prediction", "probability"
).show(truncate=False)
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+------------------------------------------+
|region                    |country|attack_type      |target_type|weapon_type|suicide|killed|wounded|severity_score|prediction|probability                               |
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+------------------------------------------+
|Middle East & North Africa|Iraq   |Bombing/Explosion|Military   |Explosives |0      |5     |2      |5.2           |1.0       |[0.0030266188339290625,0.9969733811660709]|
+--------------------------+-------+-----------------+-----------+-----------+-------+------+-------+--------------+----------+------------------------------------------+

For the hypothetical case created with a bombing/explosion attack in Iraq targeting the military using explosives, with 5 fatalities, 2 wounded, and a severity score of 5.2, the trained model predicted the incident as a suicide attack with high confidence. The prediction output shows a probability of 99.7% for suicide (class 1) and only 0.3% for non-suicide (class 0). This result highlights that the combination of factors such as high lethality (killed and wounded) and the nature of the attack strongly influences the model’s classification toward identifying suicide-related incidents.